Third-Party Tools
Tools built on top of the GATK developed by other groups


Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovering and genotyping structural variations using sequencing data. The methods are designed to detect shared variation using data from multiple individuals, but can also process single genomes. Please see the GenomeSTRiP website for more information.

You are welcome to ask questions and report problems about GenomeSTRiP in this category of the forum.

Comments (0)

1. Introduction

Genome STRiP makes use of mask files that identify portions of the reference sequence that are not reliably alignable.

Genome mask files are fasta files with the same number of sequences and of the same length as the reference sequence. In a genome mask file, a base position is marked with a 0 if it is reliably alignable and 1 if it is not. Each genome mask file is specific to the reference sequence and to the parameters used to determine alignability.

The current generation of mask files are based on fixed read lengths. A base is assigned a 0 if an N base sequence centered on this read is unique within the reference genome. You should use a genome mask with a value of N that corresponds to the read lengths of your input data set. For example, if you have data that is a uniform set of Illumina paired-end data with 101bp reads, then you should use (or generate) a genome mask with a read length of 101. If your data is a mixture of read lengths, one viable strategy is to use a "lowest common denominator" approach and use a mask length corresponding to the shortest reads in your input data set. Using the smallest read length will cause a small additional fraction of the genome to be marked inaccessible, but will give the best specificity. Alternatively, you can use a larger N, which should modestly improve sensitivity at the cost of a modest increase in false discovery rate and a modest decrease in genotyping accuracy.

2. Resources

Some precomputed mask files for a variety of reference sequences and read lengths are available at ftp://ftp.broadinstitute.org/pub/svtoolkit/svmasks.

3. Generating your own genome mask

The ComputeGenomeMask command line utility is available to generate genome mask files, but queue scripts to automate the process have not been written. A reasonable strategy is to compute the genome mask in parallel chromsome-by-chromosome and then merge the resulting fasta files into a final genome-wide mask file.

4. Planned Enhancements

The implementation of mask files will be replaced in a future release.

Mask files are being converted from textual fasta files to binary files and are being enhanced to better support input data sets with multiple read lengths (so the use of a "lowest common denominator" strategy will no longer be necessary).

Comments (2)

1. Introduction

The building blocks for Genome STRiP are built out of GATK Walkers and some miscellaneous command line utilities.

If you need to implement a specialized pipeline, you can use these modules directly, using the standard Queue pipelines as a guide. The standard Queue pipelines also use Samtools, BWA and some Picard utilities.

2. GATK Walkers

ComputeInsertSizeDistributions

ComputeReadDepthCoverage

ComputeReadSpanCoverage

SVAltAlignerWalker

SVDiscoveryWalker

SVGenotyperWalker

3. Utilities

ComputeGenomeMask

ComputeGenomeMaskStatistics

ComputeInsertStatistics

GenerateAltAlleleFasta

MergeDiscoveryOutput

MergeGenotyperOutput

MergeInsertSizeDistributions

MergeReadDepthCoverage

MergeReadSpanCoverage

4. Pre-Release

This section documents various utilities and walkers that are not yet available in the production release. These utilities may be available in some of the interim releases (build snapshots) downloadable from our website http://www.broadinstitute.org/software/genomestrip.

PlotGenotypingResults

PlotInsertSizeDistributions

Comments (7)

1. Introduction

Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovery and genotyping of structural variation using sequencing data. The methods used in Genome STRiP are designed to find shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation.

Genome STRiP requires genomes from multiple individuals in order to detect or genotype variants. Typically 20 to 30 genomes are required to get good results. It is possible to use publicly available reference data (e.g. sequence data from the 1000 Genomes Project) as a background population to call events in single genomes, but this strategy has not been widely tried nor thoroughly evaluated.

Genome STRiP uses the the Genome Analysis Toolkit (GATK). There are pre-defined Queue pipelines to simplify running analyses.

The current release of Genome STRiP is focused on discovery and genotyping of deletions relative to a reference sequence. Extensions to support other types of structural variation are planned.

Genome STRiP is under active development and improvement. We are making current under-development versions available in the hopes that they may be of use to others.

To run the current versions successfully, you will need to read and understand how the method works and you may have to adapt the example scripts to your particular data set. Please report bugs through svtoolkit-help@lists.sourceforge.net.

Before posting, please review the FAQ.

2. Structure

Genome STRiP consists of a number of modules, related as shown below.

To perform discovery and genotyping, you would run all four modules in order: SVPreprocess, SVDiscovery, SVAltAlign, SVGenotyping. To genotype a set of known variants using new samples, you can skip the SVDiscovery step.

3. Inputs and Outputs

Genome STRiP requires aligned sequence data in BAM format.

The primary outputs from Genome STRiP are polymorphic sites of structural variation and/or genotypes for these sites, both of which are represented in VCF format.

Genome STRiP also requires a FASTA file containing the reference sequence used to align the input reads. The input FASTA file must be indexed using samtools faidx or the equivalent.

4. Downloading and Installation

Current and previous binary releases are available from our website http://www.broadinstitute.org/software/genomestrip.

To install, download the tarball and decompress into a suitable directory. You will need to install pre-requisite software as described below. There is a 10-minute installation/verification test in the installtest subdirectory. You will also need to download (or build) a suitable [[Genome_STRiP_Genome_Mask_Files|Genome Mask File]].

The test scripts also serve as example pipelines for running Genome STRiP.

Environment Variables

Currently, Genome STRiP requires you to set the SV_DIR environment variable to the installation directory. See the installtest scripts for details.

5. Dependencies

Java

Genome STRiP is written mostly in java and packaged as a jar file (SVToolkit.jar). You will need java 1.7.

GATK

Genome STRiP is integrated with the Genome Analysis Toolkit (GATK) and requires GenomeAnalysisTK.jar in order to run. The pipelines that automate running Genome STRiP are written as Queue scripts and these pipelines require Queue.jar to run.

The SVToolkit distribution comes with a set of compatible pre-built jar files for GATK and Queue. We can't promise source or binary compatibility between different versions of GATK and SVToolkit. If you mix and match versions, you are on your own and you should scrutinize your results carefully.

Picard

The Genome STRiP pipelines use some Picard standalone command line utilities. You will need to install these separately. URL: http://picard.sourceforge.net

Samtools

The pipelines use 'samtools index' to index BAM files. You will need to install samtools separately. URL: http://samtools.sourceforge.net

This dependency on samtools could in theory be replaced with Picard 'BuildBAMIndex', if you can't run samtools for some reason.

BWA

Several pipeline functions use BWA (the executable) and also use BWA through its C API. You will need to install BWA separately. URL: http://bio-bwa.sourceforget.net

A pre-built Linux shared library, libbwa.so, that is required by GenomeSTRiP comes with the SVToolkit distribution. This library is built from the BWA source code and source code that is part of GATK.

The current version of this library is built from BWA 0.5.8, but it should be compatible with most other versions of BWA. If you have problems, you can try running with the pre-built version of bwa included in the distribution that was built from the same version as the shared library.

R

Genome STRiP uses some R scripts internally.

To run Genome STRiP, R must be installed separately and the Rscript exectuable must be on your path.

Genome STRiP should run with R 2.8 and above and may run with older versions as well, but this has not been tested.

6. Running Genome STRiP

Before attempting to run Genome STRiP on your own data, please run the short installation test in the installtest subdirectory. This will ensure that your environment is set up properly. The test scripts also offer an example of how to organize your run directory structure and some sample end-to-end pipelines.

A number of pre-defined Queue pipeline scripts are provided to run the different phases of analysis in Genome STRiP. Queue is a flexible scala-based system for writing processing pipelines that can be distributed on compute farms. These pipeline scripts should be taken as example templates and they may need to be modified for your specific analysis.

Each processing step has a corresponding Queue pipeline script:

SVPreprocess

Preprocess a set of input BAM files to generate genome-wide metadata used by other Genome STRiP modules.

SVAltAlign

Re-alignment of reads from input BAM files to alternative alleles described in an input VCF file.

SVDiscovery

Run deletion discovery on a set of input BAM files, producing a VCF file of potentially variant sites.

SVGenotyper

Genotype a set of polymorphic structural variation loci described in a VCF file.

Genome STRiP Functions | Components

The Queue pipelines invoke a series of processing steps, most of which are implemented as GATK Walkers or as java utility programs. New pipelines can be constructed from these more elemental components. See Genome STRiP Functions for more information.

7. Support

We have set up a mailing list for bug reports and questions at svtoolkit-help@lists.sourceforge.net.

You can also consult the support page at http://sourceforge.net/projects/svtoolkit/support.

The FAQ is here.

Note that we are currently not distributing software through sourceforge. Software must be downloaded from our website http://www.broadinstitute.org/software/genomestrip.

Comments (0)

1. Introduction

The SuperArray annotator is invoked through the SVVariantAnnotator walker, which defines arguments common to all annotators.

The SuperArray annotator uses array intensity data to do a form of ''in silico'' validation of copy number variants.

This annotator requires that each variant indicate (through an INFO tag) the set of samples that are thought to carry the variant (either as homozygotes or heterozygotes). A Wilcoxon rank sum test is performed and a p-value calculated as follows: For each array probe underlying the variant, each sample is assigned an integral rank for that probe. Then the set of ranks (across all probes) is combined and treated as a set of observations for the Wilcoxon rank sum test. If there is more than one probe, there will certainly be ties (i.e. some sample will be rank 1 with respect to each probe). Ties are broken randomly to assign the final ranking.

A Wilcoxon rank sum test is then used to test whether the event-carrying samples (as indicated by the INFO tag) are shifted with respect to the non- event-carrying samples (for deletions, this is a one tailed test of a negative shift).

2. Inputs / Arguments

  • -arrayIntensityFile <data-file> : The path to an input file containing a matrix of array intensity values. : The file must be tab delimited with a header line. Each line of the file contains data for one probe. The first four columns should be named ID, CHR, START and END. ID is an identifier for the probe and the other three columns give the 1-relative coordinates of the probe (START and END can be equal). Columns beyond the first four provide intensity data for each sample. The header for each additional column should contain the sample ID.

  • -superArraySampleTag <tag-name> : This is the name of the INFO field that contains the list of carrier samples, which must be comma-separated. The default tag name is _SASAMPLES_. For example, SASAMPLES=NA12878,NA12891,NA12892 would indicate that three samples carry the variant.

  • -sample <sample-or-sample-list> : A subset of samples on which to perform the test (or a .list file of sample identifiers). The default behavior is to use all samples in the array intensity file.

  • -superArrayPermute <true/false> : If set to true, then the sample identities are permuted before performing the test to generate a null distribution.

3. Annotations

This annotator produces three INFO field annotations for each VCF record:

  • SANSAMPLES : The number of carrier samples.

  • SANPROBES : The number of probes underlying the event.

  • SAPVALUE : The calcualted p-value.

The annotator can also generate a tab-delimited report file containing these annotations.

4. Example

The SuperArray annotator requires Genome STRiP and R.

export SV_DIR=/path/to/SVToolkit/root/directory

java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.gatk.CommandLineGATK \ 
    -T SVVariantAnnotator \ 
    -A SuperArray \ 
    -R /humgen/1kg/reference/human_g1k_v37.fasta \ 
    -BTI variant \
    -B:variant,VCF input.vcf \ 
    -O output.vcf \ 
    -arrayIntensityFile Omni25_superarray_intensity_matrix.dat \ 
    -sample discovery_samples.list \
    -superArraySampleTag SAMPLES \ 
    -writeReport \ 
    -reportFile superarray_output.dat

5. Performance

The SuperArray annotator uses R as well as java and can consume up to 10G of memory.

The SuperArray annotator uses an exact test in many cases and this can be expensive. If you want to test more than a few hundred variants, you should consider splitting up the input VCF file and processing them in parallel. In sample runs, testing 1000 variants in 1000 samples can take 2 to 3 hours.

Comments (0)

1. Introduction

The VariantsPerSample annotator is invoked through the SVVariantAnnotator walker, which defines arguments common to all annotators.

The VariantsPerSample annotator is a simple annotator that counts how many variants are in each sampled genome. The output can be rolled up into population level statistics.

In the current implementation, VariantsPerSample uses the GSSAMPLES INFO tag in the input VCF file to determine which samples carry the variant. The input VCF file should not contain genotypes.

Filtered variants are not counted in the totals.

2. Inputs / Arguments

  • -populationMap <map-file> : A tab-delimited input file containing two columns: the sample ID and a population ID for that sample. If supplied, the population information will be carried over into the output report.

3. Annotations

No VCF annotations are produced, but this annotator is used to produce a report file. The report file will contain one line per sample. The report includes the number of variants and also the population if -populationMap is supplied.

4. Example

java -Xmx2g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.gatk.CommandLineGATK \ 
    -T SVVariantAnnotator \ 
    -A VariantsPerSample \ 
    -R /humgen/1kg/reference/human_g1k_v37.fasta \ 
    -BTI variant \ 
    -B:variant,VCF input.vcf \ 
    -populationMap sample_to_population.map \
    -writeReport \ 
    -reportFile variants_per_sample.dat
Comments (0)

1. Introduction

The ComputeGenomeMask utility determines the alignability of each base in the reference genome.

Mask files are generated based on a fixed read length L. A base is considered alignable if a window of length L centered on the base is unique within the reference sequence.

The ComputeGenomeMask utility works by dividing the input genome into sequences of length L for every base position and then aligning these sequences against the reference genome (using BWA) and looking for unique matches.

The implementation of genome masking in Genome STRiP is likely to change in the future.

2. Inputs / Arguments

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence to mask. : The fasta file must have been indexed using BWA in preparation for BWA alignment. The fasta file must also be indexed with 'samtools faidx' or the equivalent.

  • -readLength <length> : The size of the window to use for determining alignability.

  • -sequence <sequenceName> : The name of the sequence to process (default is to process the entire genome). This can be used to parallelize the computation by chromosome.

3. Outputs

  • -O <mask-file> : Output genome mask file (fasta format). Default is to write to stdout. The mask file contains '0' at alignable positions and '1' at non-alignable positions.

4. Running

ComputeGenomeMask is part of Genome STRiP, which is a third-party GATK library.

An example invocation is shown below:

export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}

java -Xmx2g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sv.apps.ComputeGenomeMask \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -O Homo_sapiens_assembly18.mask.chr1.36.fasta \ 
    -readLength 36 \ 
    -sequence chr1 

To generate a mask for the entire genome, it is generally preferably to compute the mask for each chromosome separately and then concatenate the output files in the correct order.

The following script provides an example of generating the appropiate indexes and running ComputeGenomeMask in parallel on each chromosome. This is an example and not a general purpose script; it will likely need to be modified for your environment. After each chromosome has been run, the resulting files must be concatenated together in the same order as the reference sequence to create the final mask.

#!/bin/bash

outdir=example readLength=75 reference=Homo_sapiens_hg19.fasta
export SV_DIR=/humgen/cnp04/bobh/svtoolkit/stable

#These executables must be on your path.
which java > /dev/null || exit 1
which bwa > /dev/null || exit 1
#The directory containing libbwa.so must be on your LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}
classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar"

mkdir -p ${outdir}/work

localReference=${outdir}/work/`echo ${reference} | awk -F / '{ print $NF }'`
if [ ! -e ${localReference} ]; then ln ${reference} ${localReference} || exit 1 fi

java -cp ${classpath} -Xmx4g \ 
    org.broadinstitute.sv.apps.IndexFastaFile \ 
    -I ${localReference} \ 
    -O ${localReference}.fai \
    || exit 1

bwa index -a bwtsw ${localReference} || exit 1

chroms=`cat ${localReference}.fai | cut -f 1`
for chr in ${chroms}; do
    bsub -o ${outdir}/work/svmask_${chr}.log \ 
    -R "rusage[mem=5000]" \ 
    java -cp ${classpath} -Xmx4g \ 
        org.broadinstitute.sv.apps.ComputeGenomeMask \ 
        -R ${localReference} \ 
        -O ${outdir}/work/svmask_${chr}.fasta \ 
        -readLength ${readLength} \ 
        -sequence ${chr} \
        || exit 1
done
Comments (0)

1. Introduction

The ComputeGenomeMaskStatistics utility prints the number of alignable and unalignable bases.

See ComputeGenomeMask.

2. Inputs / Arguments

  • -I <mask-file> : The genome mask file. An indexed fasta file containing the genome mask. The fasta file must be indexed with samtools faidx or the equivalent.

  • -R <fasta-file> : Reference sequence. An indexed fasta file containing the reference sequence to mask. The fasta file must be indexed with samtools faidx or the equivalent.

3. Outputs

  • -O <mask-file> : Tab-delimited output file containing the number of alignable and unalignable bases (per chromosome and total). Default is to write to stdout.
Comments (0)

1. Introduction

The ComputeInsertSizeDistributions walker traverses a set of BAM files to generate histograms of insert sizes.

The insert size histograms are stored in a binary file format. Many histograms can be stored in the same file. The histograms are identified by <Sample, Library, ReadGroup> triples. The trailing components can be null. For example, if histograms are computed library-by-library (the default), then the ReadGroup in each triple will be null.

See also MergeInsertSizeDistributions, ComputeInsertStatistics.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files.

  • -md <directory> : The metadata directory. Currently only used to check for a default list of excluded read groups.

  • -overwrite : If true (the default), overwrite the output file, otherwise append.

  • -createEmpty : If true, create a zero length output file if there are no paired reads in the input (default false).

3. Outputs

  • -O <histogram-file> : Location of the output binary histogram file.
Comments (0)

1. Introduction

The ComputeInsertStatistics utility prints statistics about the data in a binary histogram file.

The printed metrics include the median insert size and robust standard deviation (RSD). The RSD is the width of the central 68% of the insert size distribution.

2. Inputs / Arguments

  • -I <histogram-file> : The input binary histogram file.

3. Outputs

  • -O <output-file> : Tab-delimited output file (default is stdout).
Comments (2)

1. Introduction

The ComputeReadDepthCoverage walker traverses a set of BAM files to generate genome-wide read depth statistics.

The read depth coverage is the number of fragments confidently aligning to the genome (or interval if an interval subset is used). The read depth coverage is measured only at confidently alignable bases (based on the genome mask) and only using reads passing the mapping quality filter. Reads are counted as aligning at the middle aligned base. Sequenced molecules are counted only once: for paired-end data, only the leftmost end of the pair (the end with the lowest reference coordinate) is counted.

Read depth coverage is computed and reported for each readgroup, but the output is keyed by sample and library to allow easy roll up.

See also MergeReadDepthCoverage.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files.

  • -minMapQ <quality-threshold> : Reads below this mapping quality are not counted.

  • -excludedReadGroups <file> : Optional file containing a list of read groups to ignore (one read group per line).

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of the reference sequence. : See Genome Mask Files.

3. Outputs

  • -O <coverage-file> : Tab delimited output file (default is stdout).
Comments (0)

1. Introduction

The ComputeReadSpanCoverage walker traverses a set of BAM files to generate genome-wide statistics.

The read span coverage is the count of bases in between two paired-end reads, not counting the lengths of the reads themselves. For fixed-length reads of length L with ungapped alignments, this would be InsertSize - 2*L. The read span coverage is used as an estimate of the power for detecting breakpoints using read pairs. This estimate assumes a model where the aligner is unlikely to align a read to a breakpoint unless the breakpoint is close to the end of the read.

Read pairs where the ends align to different sequences are never counted.

Read span coverage is computed and reported for each readgroup, but the output is keyed by sample and library to allow easy roll up.

See also MergeReadSpanCoverage.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files.

  • -md <directory> : The metadata directory. Insert size histograms are loaded from the default isd.hist.bin file in this directory. This argument is also used to load a default list of excluded read groups.

  • -maxInsertSize <n> : Read pairs with an insert size greater than n are not counted in span coverage.

  • -maxInsertSizeStandardDeviations <sd> : Read pairs with an insert size greater than the median plus sd robust standard deviations are not counted in span coverage.

3. Outputs

  • -O <span-coverage-file> : Tab delimited output file (default is stdout).
Comments (0)

1. What does error message X mean?

See the FAQ section on frequently encountered errors.

2. Can I use Genome STRiP to do discovery or genotyping in a single high-coverage individual?

Genome STRiP is designed to discover and genotype variants in populations and uses the information from multiple individuals simultaneously. Typically you will need data from at least 20 or 30 individuals to get good results.

That being said, it may be possible to use a "background population" along with a single high-coverage individual to run Genome STRiP. The background population does not need to have the same depth of coverage as the target genome you want to process, but reads will need to be aligned to the same reference sequence. A good background population might be 50 or so individuals from the 1000 Genomes Project chosen from diverse population groups. This approach has not been widely tested, although I have looked at targeted resequencing loci using this strategy with some success. If you try this strategy, please share your experiences.

3. Does Genome STRiP only work with deletions?

In the current version, only deletions (relative to the reference) are supported in discovery and genotyping. We are actively working on discovery and genotyping of other kinds of structural variants.

4. Is the source code available?

Not at this time, but we are planning to release the source code shortly.

5. Can I run discovery on a small genomic region?

If you have whole-genome sequence data, you can run on just a small region using the standard -L argument to the GATK. For example -L chr1:1000000-2000000.

If you have targeted resequencing data, where you have only sequenced a small subset of the genome, then you additionally need to set the effective genome size to be smaller. To do this, you currently need to modify the configuration parameters in conf/genstrip_parameters.txt (the file location is specified with the -configFile command line argument).

You will need to change these three parameters:

input.genomeSize = A + X + Y 
input.genomeSizeMale = 2*A + X + Y 
input.genomeSizeFemale = 2*A + 2*X

where A is the total size of the autosomal reference and X and Y are the lengths of the X and Y chromosomes. Note that genomeSize is in haploid bases while genomeSizeMale and genomeSizeFemale are in diploid bases.

Of course, if your target region doesn't include X or Y, then just set genomeSizeMale and genomeSizeFemale to 2*genomeSize. See the installtest configuration file for an example, where the effective genome size is set to 200Kb.

Comments (0)

1. SVGenotyper ERROR MESSAGE: Please invoke this walker with -BTI

This can be caused by an incorrect invocation of the command or a malformed input file.

The SVGenotyper walker requires the GATK -BTI argument, specifying the ROD which contains the input VCF file. Typically something like this:

-BTI input 
-B:input,VCF my_input_file.vcf

Another problem that can cause this same symptom is a malformed VCF input file. For example, if you have spaces instead of tabs in the VCF file, the GATK can fail to parse the VCF file and it can cause this error.

2. SVGenotyper ERROR MESSAGE: net.sf.picard.PicardException: Malformed query; start point 59152350 lies after end point 59128983

This can be caused by a malformed VCF file where the POS field is larger than the length of the chromosome. The "end point" value is the length of the chromosome.

Comments (4)

1. Introduction

The GenerateAltAlleleFasta utility processes a VCF file to extract the sequences of the alternate alleles.

For each structural variation record in the VCF, this utility will generate one output sequence in fasta format for each alternative allele that has precise breakpoints. The identifier for the alternate allele will be variantID_alleleNumber where alleleNumber is the number of the allele in the ALT column of the VCF file (the first ALT allele is allele 1).

The remainder of each fasta header line after the ID contains an encoded description of how the allele sequence maps back to the reference genome. The naming convention for the fasta sequences and the format of the rest of the header line is understood by other programs that use the alternate allele fasta file as input.

Here is an example of a generated fasta header:

>P2_M_061510_20_81_1 L:chr20:51913435-51913634;1-200|R:chr20:51913736-51913935;202-401|LENGTH:401

This example us for the first alternate allele of a variant with ID P2_M_061510_20_81. The length of the generated fasta sequence is 401 bases. Bases 1-200 of the alternate allele sequence aligns to chr20:51913435-51913634 of the reference sequence and bases 202-401 of the fasta sequence aligns to bases chr20:51913736-51913935 of the reference sequence. Thus, this event represents a deletion of 101bp of the reference (chr20:51913635-51913735) with one base of non-template sequence present in the alternate allele.

2. Inputs / Arguments

  • -I <vcf-file> : The input VCF file.

  • -R <fasta-file> : Reference sequence. An indexed fasta file containing the reference sequence. The fasta file must be indexed with 'samtools faidx' or the equivalent.

  • -flankLength <N> : The number of reference bases to include around each alternate allele (default 200). The flank length is counted outside of any micro-homology around the breakpoints.

3. Outputs

  • -O <fasta-file> : An output fasta file containing one entry for each alternative structural allele. The default is to write to stdout.
Comments (0)

1. Introduction

The MergeDiscoveryOutput utility merges the results of a parallel discovery run into genome-wide output files.

In addition to merging the main VCF files, this utility will also merge the auxilliary output files (but leave them in the run directory).

2. Inputs / Arguments

  • -runDirectory <directory> : The run directory where the output files from each partition are stored.

3. Outputs

  • -O <vcf-file> : The location to write the merged output VCF file.
Comments (0)

1. Introduction

The MergeGenotyperOutput utility merges the results of a parallel genotyping run into genome-wide output files.

In addition to merging the main VCF files, this utility will also merge the auxilliary output files (but leave them in the run directory).

2. Inputs / Arguments

  • -runDirectory <directory> : The run directory where the output files from each partition are stored.

3. Outputs

  • -O <vcf-file> : The location to write the merged output VCF file.
Comments (0)

1. Introduction

The MergeInsertSizeDistributions utility combines multiple insert size histogram files into one.

The insert size histograms are stored in a binary file format.

2. Inputs / Arguments

  • -I <histogram-file> : The set of input histogram files.

  • -disjoint : If specified, assume the input files contain disjoint data, which is faster. If not specified, histograms with the same <Sample, Library, ReadGroup> key are merged.

3. Outputs

  • -O <histogram-file> : Output binary histogram file.
Comments (0)

1. Introduction

The MergeReadDepthCoverage utility merges files containing read depth counts into one file.

This utility is used for parallelizing computation of read depth coverage.

See also ComputeReadDepthCoverage.

2. Inputs / Arguments

  • -I <coverage-file> : The set of input coverage files.

3. Outputs

  • -O <coverage-file> : Tab delimited merged output file.
Comments (0)

1. Introduction

The MergeReadSpanCoverage utility merges files containing read span counts into one file.

This utility is used for parallelizing computation of read span coverage.

See also ComputeReadSpanCoverage.

2. Inputs / Arguments

  • -I <span-coverage-file> : The set of input span coverage files.

3. Outputs

  • -O <span-coverage-file> : Tab delimited merged output file.
Comments (0)

1. Introduction

The PlotGenotypingResults utility produces plots that help to visualize the genotyping data for individual sites.

The plots show a histogram of read depth, decorated with information about aberrant read pairs and/or breakpoint spanning reads and showing the called genotype for each sample. This utility requires information from auxilliary data files produced during genotyping which are written to the genotyping run directory.

2. Inputs / Arguments

  • -site <site-list> : The ID of a genotyped site (required). This should match the third column in the output vcf file. This argument can be supplied multiple times. It can also be a comma-separated list of sites or a file (with extension .list) containing a list of sites, one site per line.

  • -runDirectory <dir> : The run directory from a genotyping run which contains the auxilliary output files. This is the typical way to specify the location of the auxilliary output files.

  • -partitionMapFile <file> : Full path to the partition map file that maps site IDs to the partition used during genotyping (when genotyping is run in parallel). The default value is partition.genotypes.map.dat in the specified run directory. If you did not use parallel genotyping, you need to supply -auxFilePrefix instead.

  • -auxFilePrefix <string> : The path prefix to the auxilliary output files for the listed sites. This is the part of the path without the suffixes such as .genotypes.vcf or .genotypes.gts.dat. You can use either the auxilliary files for a single partition (if all sites to plot come from that partition) or the merged auxilliary files for the entire run, but the latter requires more I/O and memory and is recommended only for small runs. You should supply either the -runDirectory (and optionally the -partitionMapFile) argument or the -auxFilePrefix argument.

  • -genderMapFile <map-file> : Path to a map file specifying the gender of each sample (optional). If supplied, the gender of each sample is shown for sites on chrX and chrY by overlaying the sample with a colored dot (blue for boys, red for girls). The map file should contain two tab-delimited columns, SAMPLE and GENDER (M/F).

3. Outputs

  • -O <output-file> : The destination output file (PDF) (required).

4. Running

java -Xmx2g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sv.apps.PlotGenotypingResults \ 
    -site DEL_P0001_1 \ 
    -site my_site_file.list \ 
    -O my_output_file.pdf \ 
    -runDirectory genotyping_run_dir
Comments (0)

1. Introduction

The PlotInsertSizeDistributions generates plots of the insert size distributions of each sequencing library.

For sequencing libraries with paired reads, one can measure the empirical distribution of the lengths of the DNA fragments in the library. Genome STRiP measures and models these insert size distributions and uses the modeled distributions for discovery and for genotyping. This utility plots the insert size distributions as modeled by Genome STRiP.

2. Inputs / Arguments

  • -I <histogram-file> : The input binary histogram file containing the insert size distribution data [required]. : Currently only isd.hist.bin files are supported (not the reduced representation isd.dist.bin files).

  • -library <library-ID> : The library or libraries to plot (this argument may be specified multiple times or as a .list file with extension .list containing one library ID per line).

  • -sample <sample-ID> : The sample or samples to plot (this argument may be specified multiple times or as a .list file with extension .list containing one sample ID per line).

3. Outputs

  • -O <output-file> : The destination output file (PDF) [required].

4. Running

java -Xmx2g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sv.apps.PlotInsertSizeDistributions \ 
    -I metadata/isd.hist.bin \ 
    -sample NA12878 \ 
    -O my_output_file.pdf 
Comments (0)

1. Introduction

SVAltAlign.q is a sample Queue script that is part of Genome STRiP.

This script realigned previously unmapped reads against putative alternate alleles generated from a VCF file describing a set of variants to be genotypes. The output is a merged bam file that contains these alignements to the alternate alleles. These alterante allele alignments are then used as input to genotyping.

2. Inputs / Arguments

  • -vcf <input-vcf-file> : A VCF file containing descriptions of the structural variations. : Only records for structural variations with precise breakpoints will be processed.

  • -I <bam-file> : The set of input BAM files containing records to realign.

  • -md <directory> : The metadata directory containing metadata about the input data set.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence. The fasta file must be indexed with samtools faidx or the equivalent.

  • -altAlleleFlankLength <n> : The length of flanking sequence from the reference genome used during realignment (default 200).

  • -alignUnmappedMates <boolean> : Whether to align unmapped mates of mapped reads to the alternate alleles (default true). : If false, then unmapped reads with a POS field will not be ignored.

  • -configFile <configuration-file> : This file contains values for specialized settings that do not normally need to be changed. : A default configuration file is provided in conf/genstrip_parameters.txt.

3. Outputs

  • -O <bam-file> : The default output for this pipeline is a single merged bam file for all input bam files and all alternate alleles. : The sequence identifier for an alternate allele is VariantID_N where N is the index of the alternate allele in the VCF file (i.e. the first alternate allele is allele 1).

4. Running

The SVAltAlign.q script is run through Queue.

Because Genome STRiP is a third-party GATK library, the Queue command line must be invoked explicitly, as shown in the example below.

java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.queue.QCommandLine \ 
    -S SVAltAlign.q \ 
    -S SVQScript.q \ 
    -gatk GenomeAnalysisTK.jar \ 
    -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    -configFile /path/to/svtoolkit/conf/genstrip_parameters.txt \ 
    -tempDir /path/to/tmp/dir \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -vcf input.vcf \ 
    -I input1.bam -I input2.bam \ 
    -O output.bam \ 
    -run \ 
    -bsub \
    -jobQueue gsa \ 
    -jobProject 1KG \ 
    -jobLogDir logs 

5. Typical Queue Arguments

Queue typically requires the following arguments to run Genome STRiP pipelines.

  • -run : Actually run the pipeline (default is to do a dry run).

  • -S <queue-script> : Script to run. : The base script SVQScript.q from the SVToolkit should also be specified with a separate -S argument.

  • -gatk <jar-file> : The path to the GATK jar file.

  • -cp <classpath> : The java classpath to use for pipeline commands. This must include SVToolkit.jar and GenomeAnalysisTK.jar. : Note: Both -cp arguments are required in the example command. The first -cp argument is for the invocation of Queue itself, the second -cp argument is for the invocation of pipeline processes that will be run by Queue.

  • -tempDir <directory> : Path to a directory to use for temporary files.

6. Queue LSF Arguments

  • -bsub : Use LSF to submit jobs.

  • -jobQueue <queue-name> : LSF queue to use.

  • -jobProject <project-name> : LSF project to use for accounting.

  • -jobLogDir <directory> : Directory for LSF log files.

Comments (0)

1. Introduction

SVDiscovery.q is a sample Queue script that is part of Genome STRiP.

This script runs deletion discovery over an input data set based on a set of bam files. The input bam files must have been previously run through the SVPreprocess pipeline to generate auxilliary metadata.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files.

  • -md <directory> : The metadata directory in which to store computed metadata about the input data set.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence that the input BAM files were aligned against. The fasta file must be indexed with 'samtools faidx' or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of the reference sequence. : See Genome Mask Files.

  • -genderMapFile <gender-map-file> : A file that contains the expected gender for each sample. : Tab delimited file with sample ID and gender on each line. Gender can be specified as M/F or 1 (male) and 2 (female).

  • -configFile <configuration-file> : This file contains values for specialized settings that do not normally need to be changed. : A default configuration file is provided in conf/genstrip_parameters.txt.

  • -runDirectory <directory> : Directory in which to place output files and intermediate run files.

  • -minimumSize <n> : The minimum size of an event that should be detected.

  • -maximumSize <n> : The maximum size of an event that should be detected. : This parameter also determines the size of the overlap between search windows when partitioning for parallel processing.

  • -windowSize <n> : For parallel processing, the size of each genomic locus to process in parallel. : The maximum size of an event that can be detected is also limited by the window size.

  • -windowPadding : This parameter specifies how far outside the search locus should be searched for informative read pairs. : This should be set based on the maximum insert sizes for read pairs in the input data set. : Ideally, we should estimate this parameter from the input data set.

3. Outputs

  • -O <vcf-file> : The main output is a VCF file containing candidate SV sites. : The output VCF file also contains a variety of metrics in the INFO field that should be used for filtering to select a final set of SV calls.

The SVDiscovery pipeline also produces a number of other intermediate output files, useful mostly for debugging. The content of these files is not documented and is subject to change. If the genome is processed in parallel, there will be output from each parallel partition plus merged genome-wide output.

4. Running

The SVDiscovery.q script is run through Queue.

Because Genome STRiP is a third-party GATK library, the Queue command line must be invoked explicitly, as shown in the example below.

java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.queue.QCommandLine \ 
    -S SVDiscovery.q \ 
    -S SVQScript.q \ 
    -gatk GenomeAnalysisTK.jar \ 
    -cp SVToolkit.jar:GenomeAnalysisTK.jar \ 
    -configFile conf/genstrip_parameters.txt \ 
    -tempDir /path/to/tmp/dir \ 
    -runDirectory run1 \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -genderMapFile sample_genders.map \ 
    -I input1.bam -I input2.bam \ 
    -O output.sites.vcf \ 
    -minimumSize 100 \
    -maximumSize 1000000 \ 
    -windowSize 10000000 \ 
    -windowPadding 10000 \ 
    -run \
    -bsub \ 
    -jobQueue lsf_queue_name \ 
    -jobProject lsf_project \ 
    -jobLogDir logs

5. Parallel Processing

The discovery pipeline is designed to allow parallelism across many processors. Parallelism is achieved by partitioning the problem space and running on overlapping genomic windows, then merging the output from each partition.

One practical strategy, which was employed in the pilot phase of the 1000 Genomes Project and is as shown in the example above, is to search for deletions between 100 bases and 1Mb in 10Mb windows (overlapping by 1Mb) using 10Kb padding, based on having paired-end libraries with relatively small insert sizes (100 bases - 2Kb).

It is also possible to partition the problem space by size of event. For example, you could search for small events (< 1Kb), medium sized events (1Kb - 100Kb) and large events (100Kb - 10Mb) in separate parallel runs, each with suitable window sizes, and then merge the outputs together. This approach is not currently implemented in an automated QScript, but if you are adventurous you could do this fairly easily using the same underlying building blocks as the standard pipeline.

6. Typical Queue Arguments

Queue typically requires the following arguments to run Genome STRiP pipelines.

  • -run : Actually run the pipeline (default is to do a dry run).

  • -S <queue-script> : Script to run. : The base script SVQScript.q from the SVToolkit should also be specified with a separate -S argument.

  • -gatk <jar-file> : The path to the GATK jar file.

  • -cp <classpath> : The java classpath to use for pipeline commands. This must include SVToolkit.jar and GenomeAnalysisTK.jar. : Note: Both -cp arguments are required in the example command. The first -cp argument is for the invocation of Queue itself, the second -cp argument is for the invocation of pipeline processes that will be run by Queue.

  • -tempDir <directory> : Path to a directory to use for temporary files.

7. Queue LSF Arguments

  • -bsub : Use LSF to submit jobs.

  • -jobQueue <queue-name> : LSF queue to use.

  • -jobProject <project-name> : LSF project to use for accounting.

  • -jobLogDir <directory> : Directory for LSF log files.

Comments (0)

Introduction

SVGenotyper.q is a sample Queue script that is part of Genome STRiP.

This script genotypes a set of input structural variation loci to determine the structural alleles carried by each sample. The script takes as input a VCF file of variant sites and a set of input bam files that have been previously run through the SVPreprocess pipeline to generate auxilliary metadata. To use split reads in genotyping, you also need to run the input VCF file through the SVAltAlign pipeline, which will realign all previously unmapped reads to the alternate alleles specified in the VCF file.

The input VCF file can be the output of Genome STRiP SV discovery, or it can be a VCF file of known or putative variants, or it can be the output from another SV discovery algorithm.

Currently, only genotyping of deletions (relative to the reference sequence) is supported. Although there is experimental code for genotyping other categories of structural variation, this code is not ready for external use.

Inputs / Arguments

  • -vcf <input-vcf-file> : A VCF file containing descriptions of the structural variations to genotype.

  • -I <bam-file> : The set of input BAM files.

  • -md <directory> : The metadata directory in which to store computed metadata about the input data set.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence that the input BAM files were aligned against. The fasta file must be indexed with 'samtools faidx' or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of the reference sequence. : See Genome Mask Files.

  • -genderMapFile <gender-map-file> : A file that contains the expected gender for each sample. Tab delimited file with sample ID and gender on each line. Gender can be specified as M/F or 1 (male) and 2 (female).

  • -configFile <configuration-file> : This file contains values for specialized settings that do not normally need to be changed. A default configuration file is provided in conf/genstrip_parameters.txt.

  • -runDirectory <directory> : Directory in which to place output files and intermediate run files.

  • -altAlignements <bam-file> : A BAM file of alternate allele alignments produced by the SVAltAlign pipeline.

  • -parallelJobs <n> : Run using N parallel jobs by partitioning the input VCF file into N subsets.

  • -parallelRecords <n> : Run in parallel processing N VCF records in each parallel job.

Outputs

  • -O <vcf-file> : The main output is a VCF file containing the input SV records plus genotypes for each sample in the input bam files. : The output VCF file will include genotype likelihoods for each sample at each variant site plus hard calls at a threshold of 95% confidence.

The SVGenotyper pipeline also produces a number of other intermediate output files, useful mostly for debugging. The content of these files is not documented and is subject to change. If the genome is processed in parallel, there will be output from each parallel partition plus merged genome-wide output.

Running

The SVGenotyper.q script is run through Queue.

Because Genome STRiP is a third-party GATK library, the Queue command line must be invoked explicitly, as shown in the example below.

java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.queue.QCommandLine \ 
    -S SVGenotyper.q \ 
    -S SVQScript.q \ 
    -gatk GenomeAnalysisTK.jar \ 
    -cp SVToolkit.jar:GenomeAnalysisTK.jar \ 
    -configFile /path/to/svtoolkit/conf/genstrip_parameters.txt \ 
    -tempDir /path/to/tmp/dir \
    -runDirectory run1 \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -genderMapFile
    sample_genders.map \ 
    -I input1.bam -I input2.bam \ 
    -altAlignments altalign.bam \ 
    -vcf input.sites.vcf \ 
    -O output.genotypes.vcf \ 
    -parallelJobs 100 \ 
    -run \
    -bsub \ 
    -jobQueue lsf_queue_name \ 
    -jobProject lsf_project \ 
    -jobLogDir logs

Parallel Processing

The genotyping pipeline is designed to allow parallelism across many processors. Parallelism is achieved by partitioning the input VCF file, genotyping each subset of variants separately, and merging the results into the final output VCF file.

Overlapping Variants

Overlapping variants are currently genotyped independently. The posterior genotype likelihoods for one event do not effect the posterior genotype likelihoods for any other event, even if the events overlap and have incompatible alleles.

Typical Queue Arguments

Queue typically requires the following arguments to run Genome STRiP pipelines.

  • -run : Actually run the pipeline (default is to do a dry run).

  • -S <queue-script> : Script to run. : The base script SVQScript.q from the SVToolkit should also be specified with a separate -S argument.

  • -gatk <jar-file> : The path to the GATK jar file.

  • -cp <classpath> : The java classpath to use for pipeline commands. This must include SVToolkit.jar and GenomeAnalysisTK.jar. Note: Both -cp arguments are required in the example command. The first -cp argument is for the invocation of Queue itself, the second -cp argument is for the invocation of pipeline processes that will be run by Queue.

  • -tempDir <directory> : Path to a directory to use for temporary files.

Queue LSF Arguments

  • -bsub : Use LSF to submit jobs.

  • -jobQueue <queue-name> : LSF queue to use.

  • -jobProject <project-name> : LSF project to use for accounting.

  • -jobLogDir <directory> : Directory for LSF log files.

Comments (0)

1. Introduction

SVPreprocess.q is a sample Queue script that is part of Genome STRiP.

This script preprocesses a set of input BAM files to generate genome-wide metadata that will be used in subsequent phases of Genome STRiP.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files. : These files form a "data set" that will be analyzed by Genome STRiP. The BAM files must have appropriate headers including read group (RG) and sample (SM) tags.

  • -md <directory> : The metadata directory in which to store computed metadata about the input data set.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence that the input BAM files were aligned against. The fasta file must be indexed with samtools faidx or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of the reference sequence. See Genome Mask Files.

  • -genderMapFile <gender-map-file> : A file that contains the expected gender for each sample. Tab delimited file with sample ID and gender on each line. Gender can be specified as M/F or 1 (male) and 2 (female).

  • -configFile <configuration-file> : This file contains values for specialized settings that do not normally need to be changed. A default configuration file is provided in conf/genstrip_parameters.txt.

3. Outputs

The SVPreprocess pipeline produces a number of output files in the specified metadata directory. The data files produced and the file formats used are subject to change in future releases.

Currently, output files are produced in the following categories:

1. Insert size distributions (isd)

Binary files are generated that contain information on the distribution of insert lengths for each library or read group in the input BAM files. Normally, all of these are merged into one file called isd.hist.bin.

A text file, isd.stats.dat, is also produced that contains informative statistics about each library or read group, including the median insert length and the robust standard deviation (RSD). This file can be reviewed to identify libraries with unusual insert size distributions that should be withheld from analysis.

2. Read depth (depth)

The main output is a text file, depth.dat, containing the genome-wide count of aligned fragments. The read counts are based on the filtering parameters used (principally mapping quality).

3. Read span coverage (spans)

The main output is a text file, spans.dat, containing the genome-wide coverage (in base pairs) that exists between the two alignments that comprise a read pair. The span coverage is an approximation of the power to detect a breakpoint by spanning read pairs.

4. Running

The SVPreprocess.q script is run through Queue.

Because Genome STRiP is a third-party GATK library, the Queue command line must be invoked explicitly, as shown in the example below.

java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.queue.QCommandLine \ 
    -S SVPreprocess.q \ 
    -S SVQScript.q \ 
    -gatk GenomeAnalysisTK.jar \ 
    -cp SVToolkit.jar:GenomeAnalysisTK.jar \ 
    -configFile /path/to/svtoolkit/conf/genstrip_parameters.txt \ 
    -tempDir /path/to/tmp/dir \
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -genderMapFile sample_genders.map \ 
    -I input1.bam -I input2.bam \ 
    -run \ 
    -bsub \ 
    -jobQueue lsf_queue_name \
    -jobProject lsf_project \ 
    -jobLogDir logs 

5. Typical Queue Arguments

Queue typically requires the following arguments to run Genome STRiP pipelines.

  • -run : Actually run the pipeline (default is to do a dry run).

  • -S <queue-script> : Script to run. The base script SVQScript.q from the SVToolkit should also be specified with a separate -S argument.

  • -gatk <jar-file> : The path to the GATK jar file.

  • -cp <classpath> : The java classpath to use for pipeline commands. This must include SVToolkit.jar and GenomeAnalysisTK.jar. : Note: Both -cp arguments are required in the example command. The first -cp argument is for the invocation of Queue itself, the second -cp argument is for the invocation of pipeline processes that will be run by Queue.

  • -tempDir <directory> : Path to a directory to use for temporary files.

6. Queue LSF Arguments

  • -bsub : Use LSF to submit jobs.

  • -jobQueue <queue-name> : LSF queue to use.

  • -jobProject <project-name> : LSF project to use for accounting.

  • -jobLogDir <directory> : Directory for LSF log files.

Comments (2)

1. Introduction

The SVAltAligner walker traverses a set of BAM files to compute alignments to the alternate alleles of structural variations. This walker is one component of the SVAltAlign pipeline.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files containing records to realign.

  • -altReference <fasta-file> : The fasta file for the alternate allele reference sequences. The fasta file must be indexed with 'samtools faidx' or the equivalent. This file should be the output from GenerateAltAlleleFasta.

  • -alignMappedReads : If present, then align all reads in the input BAM files, not just unmapped reads (default false).

  • -alignUnmappedMates <true/false> : If true (the default), then align unmapped mates of mapped reads (i.e. reads that have a reference position but have the unmapped flag set). If set to false, then only reads in the "unmapped" portion of the BAM file will be aligned.

  • -md <directory> : The metadata directory containing metadata about the input data set.

3. Outputs

  • -O <bam-file> : The output from this walker is a BAM file containing new alignments for input reads that align to the alternate allele reference sequences. If no output file is specified, the output is in SAM format instead and is written to standard output.
Comments (15)

1. Introduction

The SVDiscovery walker traverses a set of BAM files to perform structural variation discovery. This walker is the main component of the SVDiscovery pipeline.

Currently, only discovery of deletions relative to the reference is implemented.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files.

  • -runDirectory <directory> : The directory where auxilliary output files will be written (default is the current directory).

  • -md <directory> : The metadata directory containing metadata about the input data set. See SVPreprocess.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence that the input BAM files were aligned against. The fasta file must be indexed with 'samtools faidx' or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of the reference sequence. : See Genome Mask Files.

  • -configFile <configuration-file> : This file contains settings for specialized settings that do not normally need to be changed. : A default configuration file is provided in conf/genstrip_parameters.txt.

  • -partitionName <string> : This specifies the name of the partition being computed during parallel runs. : The output files will be prefixed with the name of the partition.

  • -searchLocus <interval> : The genomic locus being searched. : Only structural variations that fit within the specified locus will be output. If non-overlapping search loci are used, then the union of the discovered variants should be non-redundant.

  • -searchWindow <interval> : The interval to be used for searching the input BAM files. : This is typically larger than the search locus to avoid missing events due to boundary effects. : This argument should typically be set to the same value as the GATK -L argument.

  • -searchMinimumSize <size> : The minimum length of a deletion event for it to be included in the output.

  • -searchMaximumSize <size> : The maximum length of a deletion event for it to be included in the output.

3. Outputs

  • -O <vcf-file> : The main output is a VCF file containing descriptions of the variant sites along with annotations about the evidence for the variability of the site. : The output VCF file will need to be filtered, based on the annotations, to select a final set of high specificity variants.

Depending on settings in the configuration file, this walker will also produce a number of auxilliary output files. These files are mostly useful for debugging. The content and format of these files is subject to change.

4. Running

Currently, this walker needs to be invoked through a special wrapper around the GATK command line interface. This wrapper accepts all of the standard GATK command line options. An example is shown below.

java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sv.main.SVDiscovery \ 
    -T SVDiscovery \ 
    -configFile conf/genstrip_parameters.txt \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -I input1.bam -I input2.bam \ 
    -O output.sites.vcf \ 
    -runDirectory run1 \ 
    -minimumSize 100 \
    -maximumSize 1000000 \ 
    -searchLocus chr20::1-1000000 \ 
    -L chr20:1-1000000 \
    -searchWindow chr20:1-1000000 

5. Dependencies

The SV Discovery code uses some R scripts. R needs to be installed and the Rscript executable needs to be on your path to run this walker.

Comments (14)

1. Introduction

The SVGenotyper walker traverses a VCF file to compute genotypes for structural variations. This walker is the main component of the SVGenotyper pipeline.

Currently, only genotyping of deletions relative to the reference is implemented.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files.

  • -runDirectory <directory> : The directory where auxilliary output files will be written (default is the current directory).

  • -md <directory> : The metadata directory containing metadata about the input data set. See SVPreprocess.

  • -R <fasta-file> : Reference sequence. An indexed fasta file containing the reference sequence that the input BAM files were aligned against. The fasta file must be indexed with 'samtools faidx' or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of the reference sequence. See Genome Mask Files.

  • -configFile <configuration-file> : This file contains settings for specialized settings that do not normally need to be changed. A default configuration file is provided in conf/genstrip_parameters.txt.

  • -sample <sample-ID> : The sample to gentoype (or list of samples if multiple arguments are supplied). By default, genotypes are computed for all samples present in the input BAM files.

  • -sampleList <file> : A file containing the list of samples to genotype (one sample ID per line).

  • -altAlleleAlignments <bam-file> : A BAM file containing alignments to the alternate alleles of events present in the input VCF file. These alternate alignments should be computed by the SVAltAlign pipeline.

  • -partitionName <string> : This specifies the name of the partition being computed during parallel runs. The output files will be prefixed with the name of the partition.

  • -partition <partition-spec> : Describes the subset of the VCF file to process. : The format is "records:N-M" where ''N'' and ''M'' are the 1-based indexes of a range of records from the input VCF file that will be processed.

3. Outputs

  • -O <vcf-file> : The main output is a VCF file containing genotypes for structural variation sites from the input VCF file.

Depending on settings in the configuration file, this walker will also produce a number of auxilliary output files. These files are mostly useful for debugging. The content and format of these files is subject to change.

4. Running

Currently, this walker needs to be invoked through a special wrapper around the GATK command line interface. This wrapper accepts all of the standard GATK command line options. An example is shown below.

The input VCF file should be passed as a GATK ROD (reference ordered datum) file. This walker also requires the -BTI argument to be passed to the GATK engine.

java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sv.main.SVGenotyper \ 
    -T SVGenotyper \ 
    -configFile conf/genstrip_parameters.txt \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -altAlignments alt_allele_alignments.bam \ 
    -B:input,VCF input.sites.vcf \ 
    -BTI \ 
    -I input1.bam -I input2.bam \ 
    -O output.genotypes.vcf \ 
    -runDirectory run1

5. Dependencies

The SV Genotyping code uses some R scripts. R needs to be installed and the Rscript executable needs to be on your path to run this walker.

Comments (0)

1. Introduction

The SVVariantAnnotator walker is a general framework for making different annotations on VCF files containing structural variant records.

SVVariantAnnotator is conceptually similar to the GATK VariantAnnotator walker, but is tailed for use with structural variations. Each annotator is like a plug-in and you can run one or more annotators over the same file in one invocation.

Each annotator can either add annotations (typically INFO fields) to the input VCF file, or it can generate a textual report (e.g. tab delimited) for additional processing, or it can generate one or more summary reports (also text files) or all of the above. Different annotators may or may not support all modes of operation.

2. Annotators

The following annotators are currently implemented:

SuperArray

VariantsPerSample

3. Common Arguments

The following common arguments are handled by the SVVariantAnnotator walker itself:

  • -A <annotation> : The name of the annotation(s) you want to run.

  • -B:variant,VCF ''input-vcf-file'' : The input VCF file is supplied to SVVariantAnnotator as a ROD binding using the generic GATK -B argument. : The input VCF file must be sorted in coordinate order based on the reference sequence.

  • -BTI variant : You should always pass this generic GATK flag when running SVVariantAnnotator to drive iteration over the input VCF file. Note that variant is literal.

  • -R '<reference-sequence> : Indexed fasta file containing the reference sequence.

  • -writeReport : Flag indicating that the annotator should print a report file (default false). : Typically, this will contain one line per VCF record and will contain the same information as the annotations that would be stored in the VCF file, but may be easier to parse.

  • -writeSummary : Flag indicating that the annotator should write a summary file (default false). : Depending on the annotator, the summary file produces information organized in different ways. : For example, the VariantsPerSample annotator produces a summary file containing a row for each sample and the count of variants in that sample.

  • -reportDirectory <directory> : The directory to write report files and summary files (default is the current directory).

  • -tempDir <directory> : The directory to use store temporary files.

4. Outputs

  • -O <output-file> : The destination for the output VCF file. : If you do not specify -O, then the annotator will not produce VCF annotations (but may produce reports or summaries).

  • -reportFile <file-path> : The path and file name for the report file. This overrides -reportDirectory. : If not supplied, the report file is based on the name of the annotator (and will be in the report directory).

  • -summaryFile <file-path> : The path and file name for the summary file. This overrides -reportDirectory : If not supplied, each annotator generates a default summary file name (if the annotator supports writing summary files).

5. Availability

Available in SVToolkit version 1.04.

Comments (0)

1. Introduction

Genome STRiP makes use of mask files that identify portions of the reference sequence that are not reliably alignable.

Genome mask files are fasta files with the same number of sequences and of the same length as the reference sequence. In a genome mask file, a base position is marked with a 0 if it is reliably alignable and 1 if it is not. Each genome mask file is specific to the reference sequence and to the parameters used to determine alignability.

The current generation of mask files are based on fixed read lengths. A base is assigned a 0 if an N base sequence centered on this read is unique within the reference genome. You should use a genome mask with a value of N that corresponds to the read lengths of your input data set. For example, if you have data that is a uniform set of Illumina paired-end data with 101bp reads, then you should use (or generate) a genome mask with a read length of 101. If your data is a mixture of read lengths, one viable strategy is to use a "lowest common denominator" approach and use a mask length corresponding to the shortest reads in your input data set. Using the smallest read length will cause a small additional fraction of the genome to be marked inaccessible, but will give the best specificity. Alternatively, you can use a larger N, which should modestly improve sensitivity at the cost of a modest increase in false discovery rate and a modest decrease in genotyping accuracy.

2. Resources

Some precomputed mask files for a variety of reference sequences and read lengths are available at ftp://ftp.broadinstitute.org/pub/svtoolkit/svmasks.

3. Generating your own genome mask

The ComputeGenomeMask command line utility is available to generate genome mask files, but queue scripts to automate the process have not been written. A reasonable strategy is to compute the genome mask in parallel chromsome-by-chromosome and then merge the resulting fasta files into a final genome-wide mask file.

4. Planned Enhancements

The implementation of mask files will be replaced in a future release.

Mask files are being converted from textual fasta files to binary files and are being enhanced to better support input data sets with multiple read lengths (so the use of a "lowest common denominator" strategy will no longer be necessary).

Comments (2)

1. Introduction

The building blocks for Genome STRiP are built out of GATK Walkers and some miscellaneous command line utilities.

If you need to implement a specialized pipeline, you can use these modules directly, using the standard Queue pipelines as a guide. The standard Queue pipelines also use Samtools, BWA and some Picard utilities.

2. GATK Walkers

ComputeInsertSizeDistributions

ComputeReadDepthCoverage

ComputeReadSpanCoverage

SVAltAlignerWalker

SVDiscoveryWalker

SVGenotyperWalker

3. Utilities

ComputeGenomeMask

ComputeGenomeMaskStatistics

ComputeInsertStatistics

GenerateAltAlleleFasta

MergeDiscoveryOutput

MergeGenotyperOutput

MergeInsertSizeDistributions

MergeReadDepthCoverage

MergeReadSpanCoverage

4. Pre-Release

This section documents various utilities and walkers that are not yet available in the production release. These utilities may be available in some of the interim releases (build snapshots) downloadable from our website http://www.broadinstitute.org/software/genomestrip.

PlotGenotypingResults

PlotInsertSizeDistributions

Comments (7)

1. Introduction

Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovery and genotyping of structural variation using sequencing data. The methods used in Genome STRiP are designed to find shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation.

Genome STRiP requires genomes from multiple individuals in order to detect or genotype variants. Typically 20 to 30 genomes are required to get good results. It is possible to use publicly available reference data (e.g. sequence data from the 1000 Genomes Project) as a background population to call events in single genomes, but this strategy has not been widely tried nor thoroughly evaluated.

Genome STRiP uses the the Genome Analysis Toolkit (GATK). There are pre-defined Queue pipelines to simplify running analyses.

The current release of Genome STRiP is focused on discovery and genotyping of deletions relative to a reference sequence. Extensions to support other types of structural variation are planned.

Genome STRiP is under active development and improvement. We are making current under-development versions available in the hopes that they may be of use to others.

To run the current versions successfully, you will need to read and understand how the method works and you may have to adapt the example scripts to your particular data set. Please report bugs through svtoolkit-help@lists.sourceforge.net.

Before posting, please review the FAQ.

2. Structure

Genome STRiP consists of a number of modules, related as shown below.

To perform discovery and genotyping, you would run all four modules in order: SVPreprocess, SVDiscovery, SVAltAlign, SVGenotyping. To genotype a set of known variants using new samples, you can skip the SVDiscovery step.

3. Inputs and Outputs

Genome STRiP requires aligned sequence data in BAM format.

The primary outputs from Genome STRiP are polymorphic sites of structural variation and/or genotypes for these sites, both of which are represented in VCF format.

Genome STRiP also requires a FASTA file containing the reference sequence used to align the input reads. The input FASTA file must be indexed using samtools faidx or the equivalent.

4. Downloading and Installation

Current and previous binary releases are available from our website http://www.broadinstitute.org/software/genomestrip.

To install, download the tarball and decompress into a suitable directory. You will need to install pre-requisite software as described below. There is a 10-minute installation/verification test in the installtest subdirectory. You will also need to download (or build) a suitable [[Genome_STRiP_Genome_Mask_Files|Genome Mask File]].

The test scripts also serve as example pipelines for running Genome STRiP.

Environment Variables

Currently, Genome STRiP requires you to set the SV_DIR environment variable to the installation directory. See the installtest scripts for details.

5. Dependencies

Java

Genome STRiP is written mostly in java and packaged as a jar file (SVToolkit.jar). You will need java 1.7.

GATK

Genome STRiP is integrated with the Genome Analysis Toolkit (GATK) and requires GenomeAnalysisTK.jar in order to run. The pipelines that automate running Genome STRiP are written as Queue scripts and these pipelines require Queue.jar to run.

The SVToolkit distribution comes with a set of compatible pre-built jar files for GATK and Queue. We can't promise source or binary compatibility between different versions of GATK and SVToolkit. If you mix and match versions, you are on your own and you should scrutinize your results carefully.

Picard

The Genome STRiP pipelines use some Picard standalone command line utilities. You will need to install these separately. URL: http://picard.sourceforge.net

Samtools

The pipelines use 'samtools index' to index BAM files. You will need to install samtools separately. URL: http://samtools.sourceforge.net

This dependency on samtools could in theory be replaced with Picard 'BuildBAMIndex', if you can't run samtools for some reason.

BWA

Several pipeline functions use BWA (the executable) and also use BWA through its C API. You will need to install BWA separately. URL: http://bio-bwa.sourceforget.net

A pre-built Linux shared library, libbwa.so, that is required by GenomeSTRiP comes with the SVToolkit distribution. This library is built from the BWA source code and source code that is part of GATK.

The current version of this library is built from BWA 0.5.8, but it should be compatible with most other versions of BWA. If you have problems, you can try running with the pre-built version of bwa included in the distribution that was built from the same version as the shared library.

R

Genome STRiP uses some R scripts internally.

To run Genome STRiP, R must be installed separately and the Rscript exectuable must be on your path.

Genome STRiP should run with R 2.8 and above and may run with older versions as well, but this has not been tested.

6. Running Genome STRiP

Before attempting to run Genome STRiP on your own data, please run the short installation test in the installtest subdirectory. This will ensure that your environment is set up properly. The test scripts also offer an example of how to organize your run directory structure and some sample end-to-end pipelines.

A number of pre-defined Queue pipeline scripts are provided to run the different phases of analysis in Genome STRiP. Queue is a flexible scala-based system for writing processing pipelines that can be distributed on compute farms. These pipeline scripts should be taken as example templates and they may need to be modified for your specific analysis.

Each processing step has a corresponding Queue pipeline script:

SVPreprocess

Preprocess a set of input BAM files to generate genome-wide metadata used by other Genome STRiP modules.

SVAltAlign

Re-alignment of reads from input BAM files to alternative alleles described in an input VCF file.

SVDiscovery

Run deletion discovery on a set of input BAM files, producing a VCF file of potentially variant sites.

SVGenotyper

Genotype a set of polymorphic structural variation loci described in a VCF file.

Genome STRiP Functions | Components

The Queue pipelines invoke a series of processing steps, most of which are implemented as GATK Walkers or as java utility programs. New pipelines can be constructed from these more elemental components. See Genome STRiP Functions for more information.

7. Support

We have set up a mailing list for bug reports and questions at svtoolkit-help@lists.sourceforge.net.

You can also consult the support page at http://sourceforge.net/projects/svtoolkit/support.

The FAQ is here.

Note that we are currently not distributing software through sourceforge. Software must be downloaded from our website http://www.broadinstitute.org/software/genomestrip.

MuTect is a method developed at the Broad Institute for the reliable and accurate identification of somatic point mutations in next generation sequencing data of cancer genomes. Please see the MuTect website for more information.

You are welcome to ask questions and report problems about MuTect in this category of the forum.
The XHMM (eXome-Hidden Markov Model) C++ software suite calls copy number variation (CNV) from next-generation sequencing projects, where exome capture was used (or targeted sequencing, more generally). Specifically, XHMM uses principal component analysis (PCA) normalization and a hidden Markov model (HMM) to detect and genotype copy number variation (CNV) from normalized read-depth data from targeted sequencing experiments. Please see the XHMM website for more information.

You are welcome to ask questions and report problems about XHMM in this category of the forum.

See also the XHMM Google Group.