The SVDiscovery walker traverses a set of BAM files to perform structural variation discovery. This walker is the main component of the SVDiscovery pipeline.
Currently, only discovery of deletions relative to the reference is implemented.
-I <bam-file> : The set of input BAM files.
-runDirectory <directory> : The directory where auxilliary output files
will be written (default is the current directory).
-md <directory> : The metadata directory containing metadata about the
input data set. See SVPreprocess.
-R <fasta-file> : Reference sequence. : An indexed fasta file containing
the reference sequence that the input BAM files were aligned against. The
fasta file must be indexed with 'samtools faidx' or the equivalent.
-genomeMaskFile <mask-file> : Mask file that describes the alignability of
the reference sequence. : See Genome Mask Files.
-configFile <configuration-file> : This file contains settings for
specialized settings that do not normally need to be changed. : A default
configuration file is provided in conf/genstrip_parameters.txt.
-partitionName <string> : This specifies the name of the partition being
computed during parallel runs. : The output files will be prefixed with the
name of the partition.
-searchLocus <interval> : The genomic locus being searched. : Only
structural variations that fit within the specified locus will be output. If
non-overlapping search loci are used, then the union of the discovered
variants should be non-redundant.
-searchWindow <interval> : The interval to be used for searching the input
BAM files. : This is typically larger than the search locus to avoid missing
events due to boundary effects. : This argument should typically be set to the
same value as the GATK -L argument.
-searchMinimumSize <size> : The minimum length of a deletion event for it
to be included in the output.
-searchMaximumSize <size>: The maximum length of a deletion event for it to be included in the output.
-O <vcf-file>: The main output is a VCF file containing descriptions of the variant sites along with annotations about the evidence for the variability of the site. : The output VCF file will need to be filtered, based on the annotations, to select a final set of high specificity variants.
Depending on settings in the configuration file, this walker will also produce a number of auxilliary output files. These files are mostly useful for debugging. The content and format of these files is subject to change.
Currently, this walker needs to be invoked through a special wrapper around the GATK command line interface. This wrapper accepts all of the standard GATK command line options. An example is shown below.
java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \ org.broadinstitute.sv.main.SVDiscovery \ -T SVDiscovery \ -configFile conf/genstrip_parameters.txt \ -md metadata \ -R Homo_sapiens_assembly18.fasta \ -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ -I input1.bam -I input2.bam \ -O output.sites.vcf \ -runDirectory run1 \ -minimumSize 100 \ -maximumSize 1000000 \ -searchLocus chr20::1-1000000 \ -L chr20:1-1000000 \ -searchWindow chr20:1-1000000
The SV Discovery code uses some R scripts. R needs to be installed and the Rscript executable needs to be on your path to run this walker.