Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovery and genotyping of structural variation using sequencing data. The methods used in Genome STRiP are designed to find shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation.
Genome STRiP requires genomes from multiple individuals in order to detect or genotype variants. Typically 20 to 30 genomes are required to get good results. It is possible to use publicly available reference data (e.g. sequence data from the 1000 Genomes Project) as a background population to call events in single genomes, but this strategy has not been widely tried nor thoroughly evaluated.
The current release of Genome STRiP is focused on discovery and genotyping of deletions relative to a reference sequence. Extensions to support other types of structural variation are planned.
Genome STRiP is under active development and improvement. We are making current under-development versions available in the hopes that they may be of use to others.
To run the current versions successfully, you will need to read and understand how the method works and you may have to adapt the example scripts to your particular data set. Please report bugs through firstname.lastname@example.org.
Before posting, please review the FAQ.
Genome STRiP consists of a number of modules, related as shown below.
To perform discovery and genotyping, you would run all four modules in order: SVPreprocess, SVDiscovery, SVAltAlign, SVGenotyping. To genotype a set of known variants using new samples, you can skip the SVDiscovery step.
3. Inputs and Outputs
Genome STRiP requires aligned sequence data in BAM format.
The primary outputs from Genome STRiP are polymorphic sites of structural variation and/or genotypes for these sites, both of which are represented in VCF format.
Genome STRiP also requires a FASTA file containing the reference sequence used to align the input reads. The input FASTA file must be indexed using samtools faidx or the equivalent.
4. Downloading and Installation
Current and previous binary releases are available from our website http://www.broadinstitute.org/software/genomestrip.
To install, download the tarball and decompress into a suitable directory. You will need to install pre-requisite software as described below. There is a 10-minute installation/verification test in the installtest subdirectory. You will also need to download (or build) a suitable [[Genome_STRiP_Genome_Mask_Files|Genome Mask File]].
The test scripts also serve as example pipelines for running Genome STRiP.
Currently, Genome STRiP requires you to set the SV_DIR environment variable to the installation directory. See the installtest scripts for details.
Genome STRiP is written mostly in java and packaged as a jar file (SVToolkit.jar). You will need java 1.6.
Genome STRiP is integrated with the Genome Analysis Toolkit (GATK) and requires GenomeAnalysisTK.jar in order to run. The pipelines that automate running Genome STRiP are written as Queue scripts and these pipelines require Queue.jar to run.
The SVToolkit distribution comes with a set of compatible pre-built jar files for GATK and Queue. We can't promise source or binary compatibility between different versions of GATK and SVToolkit. If you mix and match versions, you are on your own and you should scrutinize your results carefully.
The Genome STRiP pipelines use some Picard standalone command line utilities. You will need to install these separately. URL: http://picard.sourceforge.net
The pipelines use 'samtools index' to index BAM files. You will need to install samtools separately. URL: http://samtools.sourceforge.net
This dependency on samtools could in theory be replaced with Picard 'BuildBAMIndex', if you can't run samtools for some reason.
Several pipeline functions use BWA (the executable) and also use BWA through its C API. You will need to install BWA separately. URL: http://bio-bwa.sourceforget.net
A pre-built Linux shared library, libbwa.so, that is required by GenomeSTRiP comes with the SVToolkit distribution. This library is built from the BWA source code and source code that is part of GATK.
The current version of this library is built from BWA 0.5.8, but it should be compatible with most other versions of BWA. If you have problems, you can try running with the pre-built version of bwa included in the distribution that was built from the same version as the shared library.
Genome STRiP uses some R scripts internally.
To run Genome STRiP, R must be installed separately and the Rscript exectuable must be on your path.
Genome STRiP should run with R 2.8 and above and may run with older versions as well, but this has not been tested.
6. Running Genome STRiP
Before attempting to run Genome STRiP on your own data, please run the short installation test in the installtest subdirectory. This will ensure that your environment is set up properly. The test scripts also offer an example of how to organize your run directory structure and some sample end-to-end pipelines.
A number of pre-defined Queue pipeline scripts are provided to run the different phases of analysis in Genome STRiP. Queue is a flexible scala-based system for writing processing pipelines that can be distributed on compute farms. These pipeline scripts should be taken as example templates and they may need to be modified for your specific analysis.
Each processing step has a corresponding Queue pipeline script:
Preprocess a set of input BAM files to generate genome-wide metadata used by other Genome STRiP modules.
Re-alignment of reads from input BAM files to alternative alleles described in an input VCF file.
Run deletion discovery on a set of input BAM files, producing a VCF file of potentially variant sites.
Genotype a set of polymorphic structural variation loci described in a VCF file.
The Queue pipelines invoke a series of processing steps, most of which are implemented as GATK Walkers or as java utility programs. New pipelines can be constructed from these more elemental components. See Genome STRiP Functions for more information.
We have set up a mailing list for bug reports and questions at email@example.com.
You can also consult the support page at http://sourceforge.net/projects/svtoolkit/support.
The FAQ is here.
Note that we are currently not distributing software through sourceforge. Software must be downloaded from our website http://www.broadinstitute.org/software/genomestrip.
The building blocks for Genome STRiP are built out of GATK Walkers and some miscellaneous command line utilities.
If you need to implement a specialized pipeline, you can use these modules directly, using the standard Queue pipelines as a guide. The standard Queue pipelines also use Samtools, BWA and some Picard utilities.
2. GATK Walkers
This section documents various utilities and walkers that are not yet available in the production release. These utilities may be available in some of the interim releases (build snapshots) downloadable from our website http://www.broadinstitute.org/software/genomestrip.
Genome STRiP makes use of mask files that identify portions of the reference sequence that are not reliably alignable.
Genome mask files are fasta files with the same number of sequences and of the same length as the reference sequence. In a genome mask file, a base position is marked with a 0 if it is reliably alignable and 1 if it is not. Each genome mask file is specific to the reference sequence and to the parameters used to determine alignability.
The current generation of mask files are based on fixed read lengths. A base is assigned a 0 if an N base sequence centered on this read is unique within the reference genome. You should use a genome mask with a value of N that corresponds to the read lengths of your input data set. For example, if you have data that is a uniform set of Illumina paired-end data with 101bp reads, then you should use (or generate) a genome mask with a read length of 101. If your data is a mixture of read lengths, one viable strategy is to use a "lowest common denominator" approach and use a mask length corresponding to the shortest reads in your input data set. Using the smallest read length will cause a small additional fraction of the genome to be marked inaccessible, but will give the best specificity. Alternatively, you can use a larger N, which should modestly improve sensitivity at the cost of a modest increase in false discovery rate and a modest decrease in genotyping accuracy.
Some precomputed mask files for a variety of reference sequences and read lengths are available at ftp://ftp.broadinstitute.org/pub/svtoolkit/svmasks.
3. Generating your own genome mask
The ComputeGenomeMask command line utility is available to generate genome mask files, but queue scripts to automate the process have not been written. A reasonable strategy is to compute the genome mask in parallel chromsome-by-chromosome and then merge the resulting fasta files into a final genome-wide mask file.
4. Planned Enhancements
The implementation of mask files will be replaced in a future release.
Mask files are being converted from textual fasta files to binary files and are being enhanced to better support input data sets with multiple read lengths (so the use of a "lowest common denominator" strategy will no longer be necessary).