How do I run ChainFinder?

ChainFinder is provided as a compiled executable file that is compatible with 64-bit unix systems. To run, ChainFinder requires the MATLAB Compiler Runtime for version 8.0 (R2012b), which can be downloaded from MathWorks at http://www.mathworks.com/products/compiler/mcr/index.html

 

To run ChainFinder, first download and unpack the folder from the link above. Then from within the ChainFinder folder, execute the command:

./run_ChainFinder.sh <MCR_directory>

where <MCR_directory> is the directory where the MCR or Matlab is installed.

 

Input files

ChainFinder requires several input files, as detailed below. Examples of each can be found within the folder “sample_data”.

The file “parameters.txt” must be provided and must list values for the parameters below in the following format (one parameter and value per line; see “parameters.txt” in the “ChainFinder” folder for an example):

<Parameter>: <value>

Parameter
(default value)

Comments

run_name
(new_run)

An identifier for a given analysis

rearrangement_data(sampledata/prostate_rr_sample.txt)

The name of a tab-delimited text file containing information about rearrangement breakpoints for each sample, formatted as described below

copy_number_data(sampledata/prostate_cn_sample.txt)

The name of a tab-delimited text file containing segmented copy number alteration data for each sample, formatted as described below

background_rate_file(sampledata/prostate_rr_sample.txt)

The name of a tab-delimited text file listing rearrangements that will be used to calculate local rates of chromosomal breakage. This can be the same file as specified for <rearrangement_data>, or another file formatted in the same manner

copy_number_type
(snp)

Indicates whether segmented copy number profiles were generated from SNP arrays (“snp”) or sequencing data (“seq”)

summarize_genes
(true)

Indicates whether or not to create an output file listing genes that are potentially disrupted by chains of rearrangements

mu_window
(1000000)

The size of the window in base-pairs for tallying the rearrangements listed in <background_rate_file> to estimate local rates of rearrangements across the genome

gene_test_window
(25000)

Genes that fall within this distance (in base-pairs) of a chain breakpoint will be noted in the gene summary output file

array_probes
(sampledata/probes_hg19.mat)

(note: this file is configured for Affymetrix SNP 6.0 arrays mapped to hg19 coordinates).

A “.mat” (MATLAB) file containing an array called “probes” composed of two columns. The chromosome number and genomic coordinate of each probe are listed in the first and second columns, respectively. The chromosomes and probe coordinates must be sorted in ascending order. X and Y chromosomes are specified as 23 and 24, respectively.

deletion_thresh
(-0.1)

Copy number segments with values below this threshold will be considered as deletions

probe_window
(8)

Indicates how far from a breakpoint to search for the edge of a deletion segment that may correspond to the breakpoint, in numbers of array probes. Note: this value is only used if <copy_number_type> is set to “snp”

bp_window
(5000)

Indicates how far from a breakpoint to search for the edge of a deletion segment that may correspond to the breakpoint, in base-pairs. Note: this value is only used if <copy_number_type> is set to “seq”

significance_thresh
(0.05)

The Benjamini-Hochberg-corrected q-value at which deviation from the independent model of rearrangements will be considered significant

genome_size
(2846426791)

Base-pairs in the reference genome build

gene_table
(gene_table_hg19.txt)

A text file containing the genomic coordinates of genes for annotation of output files (required only if <summarize_genes> is set to “true”)

test_distance_thresh
(1000000)

Breakpoints within this reference genome distance will be tested for significant adjacency

create_circos_file
(true)

Indicates whether ChainFinder should generate a “.conf” file for displaying rearrangements on a Circos plot

 

 

 

The input files listed in “parameters.txt” must be provided as tab-delimited text files in the following format. The columns listed below are required for each input file (please see the files in the “sample_data” folder for examples). The indicated header must be listed at the top of each column:

rearrangement_data:

Header

Values

sample

A unique name for each sample (must be consistent across all input files that refer to the sample and may not contain spaces)

num

A number to identify each rearrangement

chr1

Chromosome of the first breakpoint in the fusion

pos1

Base-pair coordinate of the first breakpoint in the fusion

str1

The strand direction of the first breakpoint (0 for forward, 1 for reverse)

chr2

Chromosome of the second breakpoint in the fusion

pos2

Base-pair coordinate of the second breakpoint in the fusion

str2

The strand direction of the second breakpoint (0 for forward, 1 for reverse)

site1 (optional)

An optional description of the genomic context of the first breakpoint (e.g., nearby genes)

site2 (optional)

An optional description of the genomic context of the second breakpoint (e.g., nearby genes)

 

 

background_rate_file:

Header

Values

sample

A unique name for each sample (must be consistent across all input files that refer to the sample and may not contain spaces)

chr1

Chromosome of the first breakpoint in the fusion

pos1

Base-pair coordinate of the first breakpoint in the fusion

chr2

Chromosome of the second breakpoint in the fusion

pos2

Base-pair coordinate of the second breakpoint in the fusion

 

 

copy_number_data:

Header

Values

sample

A unique name for each sample (must be consistent across all input files that refer to the sample and may not contain spaces)

chr

Chromosome

start

Base-pair coordinate of copy number segment start

end

Base-pair coordinate of copy number segment end

segment_mean

Amplitude of copy number segment (e.g. log2 ratio)

num_probes

If <copy_number_type> is set to “snp”, this indicates the number of array probes contained within the copy number segment

 

gene_table:

Header

Values

gene

Gene name

chr

Chromosome

gene_start

Base-pair coordinate of gene start

gene_end

Base-pair coordinate of gene end

 

Outputs:

Output file

Description

Chain_summary_<run_name>.txt

Summarizes rearrangement chain metrics for each sample and each chain

<sample>_chain_genes.txt

Summarizes genes that are potentially deleted in the context of a chain or are within <gene_test_window> of a chained breakpoint (one file is created for each sample)

<sample>_chains_final.txt

Annotated list of all rearrangements assigned to chains for a given sample

<sample>_chains_long.txt

Detailed output documenting the calculations performed by ChainFinder for a given sample

<sample>_chain_circos.conf

<sample>_chain_<#>.links

<sample>_cn.txt

Files created within the “Circos” folder that can be used as inputs to Circos to plot rearrangement chains coded by color