Detailed information about command line options for BaseRecalibrator can be found here.
The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.
New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.
This process is accomplished by analyzing the covariation among several features of a base. For example:
These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.
For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.
The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.
Detailed information about command line options for BaseRecalibrator can be found here.
This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:
For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.
To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:
java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fasta \ -I input.bam \ -BQSR recalibration_report.grp \ -o output.bam
After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.
This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.
Effectively the new quality score is:
Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.
The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:
This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).
Example Arguments table:
#:GATKTable:true:1:17::; #:GATKTable:Arguments:Recalibration argument collection values used in this run Argument Value covariate null default_platform null deletions_context_size 6 force_platform null insertions_context_size 6 ...
The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.
The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.
Example Arguments table:
#:GATKTable:true:2:94:::; #:GATKTable:Quantized:Quality quantization map QualityScore Count QuantizedScore 0 252 0 1 15972 1 2 553525 2 3 2190142 9 4 5369681 9 9 83645762 9 ...
This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:; #:GATKTable:RecalTable0: ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors SRR032768 D 40.7476 45.0000 2642683174 222475 SRR032766 D 40.9072 45.0000 2630282426 213441 SRR032764 D 40.5931 45.0000 2919572148 254687 SRR032769 D 40.7448 45.0000 2850110574 240094 SRR032767 D 40.6820 45.0000 2820040026 241020 SRR032765 D 40.9034 45.0000 2441035052 198258 SRR032766 M 23.2573 23.7733 2630282426 12424434 SRR032768 M 23.0281 23.5366 2642683174 13159514 SRR032769 M 23.2608 23.6920 2850110574 13451898 SRR032764 M 23.2302 23.6039 2919572148 13877177 SRR032765 M 23.0271 23.5527 2441035052 12158144 SRR032767 M 23.1195 23.5852 2820040026 13750197 SRR032766 I 41.7198 45.0000 2630282426 177017 SRR032768 I 41.5682 45.0000 2642683174 184172 SRR032769 I 41.5828 45.0000 2850110574 197959 SRR032764 I 41.2958 45.0000 2919572148 216637 SRR032765 I 41.5546 45.0000 2441035052 170651 SRR032767 I 41.5192 45.0000 2820040026 198762
This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable1: ReadGroup QualityScore EventType EmpiricalQuality Observations Errors SRR032767 49 M 33.7794 9549 3 SRR032769 49 M 36.9975 5008 0 SRR032764 49 M 39.2490 8411 0 SRR032766 18 M 17.7397 16330200 274803 SRR032768 18 M 17.7922 17707920 294405 SRR032764 45 I 41.2958 2919572148 216637 SRR032765 6 M 6.0600 3401801 842765 SRR032769 45 I 41.5828 2850110574 197959 SRR032764 6 M 6.0751 4220451 1041946 SRR032767 45 I 41.5192 2820040026 198762 SRR032769 6 M 6.3481 5045533 1169748 SRR032768 16 M 15.7681 12427549 329283 SRR032766 16 M 15.8173 11799056 309110 SRR032764 16 M 15.9033 13017244 334343 SRR032769 16 M 15.8042 13817386 363078 ...
This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable2: ReadGroup QualityScore CovariateValue CovariateName EventType EmpiricalQuality Observations Errors SRR032767 16 TACGGA Context M 14.2139 817 30 SRR032766 16 AACGGA Context M 14.9938 1420 44 SRR032765 16 TACGGA Context M 15.5145 711 19 SRR032768 16 AACGGA Context M 15.0133 1585 49 SRR032764 16 TACGGA Context M 14.5393 710 24 SRR032766 16 GACGGA Context M 17.9746 1379 21 SRR032768 45 CACCTC Context I 40.7907 575849 47 SRR032764 45 TACCTC Context I 43.8286 507088 20 SRR032769 45 TACGGC Context D 38.7536 37525 4 SRR032768 45 GACCTC Context I 46.0724 445275 10 SRR032766 45 CACCTC Context I 41.0696 575664 44 SRR032769 45 TACCTC Context I 43.4821 490491 21 SRR032766 45 CACGGC Context D 45.1471 65424 1 SRR032768 45 GACGGC Context D 45.3980 34657 0 SRR032767 45 TACGGC Context D 42.7663 37814 1 SRR032767 16 AACGGA Context M 15.9371 1647 41 SRR032764 16 GACGGA Context M 18.2642 1273 18 SRR032769 16 CACGGA Context M 13.0801 1442 70 SRR032765 16 GACGGA Context M 15.9934 1271 31 ...
The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.
If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.
I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.
All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.
I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?
Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.
I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.
The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.
However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:
For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.
New in GATK 2.0 is the capability of UnifiedGenotyper to natively call non-diploid organisms. Three use cases are currently supported:
In order to enable this feature, users need to set the -ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).
Note that all other UnifiedGenotyper arguments work in the same way.
A full minimal command line would look for example like
java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-I myReads.bam \
-T UnifiedGenotyper \
-ploidy 3
The glm argument works in the same way as in the diploid case - set to [INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.
Many of these limitations will be gradually removed in the following weeks as we iron out details and fix issues in the GATK 2.0 beta.
Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.
Some annotations do not work in non-diploid cases. In particular, current InbreedingCoeff is omitted. Annotations which do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF and Genotype annotations such as PL, AD, GT, etc.
The interaction between non-diploid calling and other experimental tools like HaplotypeCaller or ReduceReads is currently not supported.
Whereas it's entirely possible to use VQSR to filter non-diploid calls, we currently have no experience with this and can hence offer no support nor best practices for this.
Only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.
Users should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
The GATK can be particular about the ordering of a BAM file. If you find yourself in the not uncommon situation of having created or received BAM files sorted in a bad order, you can use the tool ReorderSam to generate a new BAM file where the reads have been reordered to match a well-ordered reference file.
java -jar picard/ReorderSam.jar I= lexicographc.bam O= kayrotypic.bam REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta
This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. This tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a lift over tool!
The tool, though once in the GATK, is now part of the Picard package.
The GATK discovers walker documentation by reading it out of the Javadoc, Sun's design pattern for providing documentation for packages and classes. This page will provide an extremely brief explanation of how to write Javadoc; more information on writing javadoc comments can be found in Sun's documentation.
The GATK's build system uses the javadoc parser to extract the javadoc for classes and packages and embed the contents of that javadoc in the help system. If you add Javadoc to your package or walker, it will automatically appear in the help. The javadoc parser will pick up on 'standard' javadoc comments, such as the following, taken from PrintReadsWalker:
/**
* This walker prints out the input reads in SAM format. Alternatively, the walker can write reads into a specified BAM file.
*/
You can add javadoc to your package by creating a special file, package-info.java, in the package directory. This file should consist of the javadoc for your package plus a package descriptor line. One such example follows:
/**
* @help.display.name Miscellaneous walkers (experimental)
*/
package org.broadinstitute.sting.playground.gatk.walkers;
Additionally, the GATK provides two extra custom tags for overriding the information that ultimately makes it into the help.
@help.display.name Changes the name of the package as it appears in help. Note that the name of the walker cannot be changed as it is required to be passed verbatim to the -T argument.
@help.summary Changes the description which appears on the right-hand column of the help text. This is useful if you'd like to provide a more concise description of the walker that should appear in the help.
@help.description Changes the description which appears at the bottom of the help text with -T <your walker> --help is specified. This is useful if you'd like to present a more complete description of your walker.
Walkers can be hidden from the documentation system by adding the @Hidden annotation to the top of each walker. @Hidden walkers can still be run from the command-line, but their documentation will not be visible to end users. Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.
Because the building of our help text is actually heavyweight and can dramatically increase compile time on some systems, we have a mechanism to disable help generation.
Compile with the following command:
ant -Ddisable.help=true
to disable generation of help.
Yes.
For more information, see the ExampleUnifiedGenotyper.scala or examples of using Scala's traits/mixins illustrated in the QScripts documentation.
In your QScript, define a var list and annotate it with @Argument. Initialize the value to Nil.
@Argument(doc="filter names", shortName="filter")
var filterNames: List[String] = Nil
On the command line specify the arguments by repeating the argument name.
-filter filter1 -filter filter2 -filter filter3
Then once your QScript is run, the command line arguments will be available for use in the QScript's script method.
def script {
var myCommand = new MyFunction
myCommand.filters = this.filterNames
}
For a full example of command line arguments see the QScripts documentation.
Wrap the utility with an InProcessFunction. If your functionality is reusable code you should add it to Sting Utils with Unit Tests and then invoke your new function from your InProcessFunction. Computationally or memory intensive functions should NOT be implemented as InProcessFunctions, and should be wrapped in Queue CommandLineFunctions instead.
class MySplitter extends InProcessFunction {
@Input(doc="inputs")
var in: File = _
@Output(doc="outputs")
var out: List[File] = Nil
def run {
StingUtilityMethod.quickSplitFile(in, out)
}
}
var splitter = new MySplitter
splitter.in = new File("input.txt")
splitter.out = List(new File("out1.txt"), new File("out2.txt"))
add(splitter)
See Queue CommandLineFunctions for more information on how @Input and @Output are used.
Create an instance of a ListWriterFunction and add it in your script method.
import org.broadinstitute.sting.queue.function.ListWriterFunction
</pre>
<pre>
val writeBamList = new ListWriterFunction
writeBamList.inputFiles = bamFiles
writeBamList.listFile = new File("myBams.list")
add(writeBamList)
Queue contains a trait mixin you can use to add Log4J support to your classes.
Add the import for the trait Logging to your QScript.
import org.broadinstitute.sting.queue.util.Logging
Mixin the trait to your class.
class MyScript extends Logging {
...
Then use the mixed in logger to write debug output when the user specifies -l DEBUG.
logger.debug("This will only be displayed when debugging is enabled.")
Try ant clean.
Queue relies on a lot of Scala traits / mixins. These dependencies are not always picked up by the scala/java compilers leading to partially implemented classes. If that doesn't work please let us know in the forum.
No. QScript will create all parent directories for outputs.
Queue's LSF dispatcher automatically looks up and sets the maximum runtime for whichever LSF queue is specified. If you set your -jobQueue/.jobQueue to hour then you should see something like this under bjobs -l:
RUNLIMIT
240.0 min of gsa3
Queue GridEngine functionality is community supported. See here for full details: Queue with Grid Engine.
The easiest way to do this at the moment is to mixin a trait.
First define a trait which adds your java options:
trait RemoteDebugging extends JavaCommandLineFunction {
override def javaOpts = super.javaOpts + " -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005"
}
Then mix in the trait to your walker and otherwise run it as normal:
val printReadsDebug = new PrintReads with RemoteDebugging
printReadsDebug.reference_sequence = "my.fasta"
// continue setting up your walker...
add(printReadsDebug)
If you see something like the following, it means that Queue believes that it previously successfully generated all of the outputs.
INFO 16:25:55,049 QCommandLine - Scripting ExampleUnifiedGenotyper
INFO 16:25:55,140 QCommandLine - Added 4 functions
INFO 16:25:55,140 QGraph - Generating graph.
INFO 16:25:55,164 QGraph - Generating scatter gather jobs.
INFO 16:25:55,714 QGraph - Removing original jobs.
INFO 16:25:55,716 QGraph - Adding scatter gather jobs.
INFO 16:25:55,779 QGraph - Regenerating graph.
INFO 16:25:55,790 QGraph - Running jobs.
INFO 16:25:55,853 QGraph - 0 Pend, 0 Run, 0 Fail, 10 Done
INFO 16:25:55,902 QCommandLine - Done
Queue will not re-run the job if a .done file is found for the all the outputs, e.g.: /path/to/.output.file.done. You can either remove the specific .done files yourself, or use the -startFromScratch command line option.
The GATK is an open source project that has greatly benefited from the contributions of outside users. The GATK team welcomes contributions from anyone who produces useful functionality in line with the goals of the toolkit. You are welcome to branch the GATK main repository and develop your own tools. Sometimes these tools may be useful to the GATK user community and you may want to make it part of the main GATK distribution. If so we ask you to follow our guidelines for submission of patches.
There are a few good GIT practices that you should follow to simplify the ultimate goal, which is, adding your changes to the main GATK repository.
You should always start your code by creating a new branch from the most recent version of the main repository with :
git checkout master (make sure you are in the master branch)
git fetch && git rebase origin/master (you can substitute this line for "git pull" if you have no changes in the master branch)
git checkout -b newtool (create a new branch for your new tool)
Note: If you have submitted a patch to the group, do not continue development on the same branch as we cannot guarantee that your changes will make it to the main repository unchanged.
Every time before you rebase, you have to update your copy of the main repository. To do this use:
git fetch
If you are just trying to keep up with the changes in the main repository after a fetch, you can rebase your branch at anytime using (and this should be all you need to do):
git rebase origin/master
In case there are conflicts, resolve them as you would and do:
git rebase --continue
If you don't know how to resolve the conflicts, you can always safely abort the whole process and go back to your branch before you started rebasing:
git rebase --abort
If you are done and want to generate your patches conforming to the latest repository changes, to edit, squash and reorder your commits use :
git rebase -i origin/master
At the prompt, you can follow the instructions to squash, edit and reorder accordingly. You can also do this step from IntelliJ with a visual editor that allows you to select what to edit/squash/reorder. You can also take a look at this nice tutorial on how to use interactive rebase.
It is okay to have a list of commits (numbered) somewhat like this in your local tree:
Before you can send your tools to us, you have to organize these commits so they tell a meaningful history and are self contained. To achieve this you will need to rebase so you can squash, edit and reorder your commits. This tree makes a lot of sense for your development process, but it makes no sense in the main repository history as it becomes hard to pick/revert commits and understand the history at a glance. After rebasing, you should edit your commits to look like this:
Use your commit messages wisely to help quick processing of your patches. Make sure the first line of your commit messages have less than 50 characters (title). Add a blank line and write a paragraph or more explaining what this commit represents (now that it is a package of multiple commits. It is important to have the 50 char title because this is all we see when we look at an extended history to find bugs and it is also our quick access to remember what the commit does to the repository.
A patch should be self contained. Meaning if we decide to adopt feature X and Z but not Y, we should be able to do so by only applying patches 1 and 2. If your patches are co-dependent, you should say so in the commits and justify why you didn't squash the commits together into one tool.
To generate patches, use :
git format-patch since
The since parameter is the last commit you want to generate patches from, for example: HEAD^3 will generate patches for HEAD^2, HEAD^1 and HEAD. You can also specify the commit by its id or by using the head of a branch. This is where using branches will make your life easier. If master is always up to date with the main repo with no changes, you can do:
git format-patch master (provided your master is up to date)
This will generate a patch for each commit you've created and you can simply e-mail them as an attachment to us.
Three-stage procedure:
Create a master set of sites from your N batch VCFs that you want to genotype in all samples. At this stage you need to determine how you want to resolve disagreements among the VCFs. This is your master sites VCF.
Take the master sites VCF and genotype each sample BAM file at these sites
(Optionally) Merge the single sample VCFs into a master VCF file
The first step of batch merging is to create a master set of sites that you want to genotype in all samples. To make this problem concrete, suppose I have two VCF files:
Batch 1:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891
20 9999996 . A ATC . PASS . GT:GQ 0/1:30
20 10000000 . T G . PASS . GT:GQ 0/1:30
20 10000117 . C T . FAIL . GT:GQ 0/1:30
20 10000211 . C T . PASS . GT:GQ 0/1:30
20 10001436 . A AGG . PASS . GT:GQ 1/1:30
Batch 2:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
20 9999996 . A ATC . PASS . GT:GQ 0/1:30
20 10000117 . C T . FAIL . GT:GQ 0/1:30
20 10000211 . C T . FAIL . GT:GQ 0/1:30
20 10000598 . T A . PASS . GT:GQ 1/1:30
20 10001436 . A AGGCT . PASS . GT:GQ 1/1:30
In order to merge these batches, I need to make a variety of bookkeeping and filtering decisions, as outlined in the merged VCF below:
Master VCF:
20 9999996 . A ATC . PASS . GT:GQ 0/1:30 [pass in both]
20 10000000 . T G . PASS . GT:GQ 0/1:30 [only in batch 1]
20 10000117 . C T . FAIL . GT:GQ 0/1:30 [fail in both]
20 10000211 . C T . FAIL . GT:GQ 0/1:30 [pass in 1, fail in 2, choice in unclear]
20 10000598 . T A . PASS . GT:GQ 1/1:30 [only in batch 2]
20 10001436 . A AGGCT . PASS . GT:GQ 1/1:30 [A/AGG in batch 1, A/AGGCT in batch 2, including this site may be problematic]
These issues fall into the following categories:
There are two difficult situations that must be addressed by the needs of the project merging batches:
Unfortunately, we cannot determine which is actually the correct choice, especially given the goals of the project. We leave it up the project bioinformatician to handle these cases when creating the master VCF. We are hopeful that at some point in the future we'll have a consensus approach to handle such merging, but until then this will be a manual process.
The GATK tool CombineVariants can be used to merge multiple VCF files, and parameter choices will allow you to handle some of the above issues. With tools like SelectVariants one can slice-and-dice the merged VCFs to handle these complexities as appropriate for your project's needs. For example, the above master merge can be produced with the following CombineVariants:
java -jar dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R human_g1k_v37.fasta \
-V:one,VCF combine.1.vcf -V:two,VCF combine.2.vcf \
--sites_only \
-minimalVCF \
-o master.vcf
producing the following VCF:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO
20 9999996 . A ACT . PASS set=Intersection
20 10000000 . T G . PASS set=one
20 10000117 . C T . FAIL set=FilteredInAll
20 10000211 . C T . PASS set=filterIntwo-one
20 10000598 . T A . PASS set=two
20 10001436 . A AGG,AGGCT . PASS set=Intersection
Having created the master set of sites to genotype, along with their alleles, as in the previous section, you now use the UnifiedGenotyper to genotype each sample independently at the master set of sites. This GENOTYPE_GIVEN_ALLELES mode of the UnifiedGenotyper will jump into the sample BAM file, and calculate the genotype and genotype likelihoods of the sample at the site for each of the genotypes available for the REF and ALT alleles. For example, for site 10000211, the UnifiedGenotyper would evaluate the likelihoods of the CC, CT, and TT genotypes for the sample at this site, choose the most likely configuration, and generate a VCF record containing the genotype call and the likelihoods for the three genotype configurations.
As a concrete example command line, you can genotype the master.vcf file using in the bundle sample NA12878 with the following command:
java -Xmx2g -jar dist/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R bundle/b37/human_g1k_v37.fasta \
-I bundle/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam \
-alleles master.vcf \
-L master.vcf \
-gt_mode GENOTYPE_GIVEN_ALLELES \
-out_mode EMIT_ALL_SITES \
-stand_call_conf 0.0 \
-glm BOTH \
-G none \
The -L master.vcf argument tells the UG to only genotype the sites in the master file. If you don't specify this, the UG will genotype the master sites in GGA mode, but it will also genotype all other sites in the genome in regular mode.
The last item,-G ` prevents the UG from computing annotations you don't need. This command produces something like the following output:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
20 9999996 . A ACT 4576.19 . . GT:DP:GQ:PL 1/1:76:99:4576,229,0
20 10000000 . T G 0 . . GT:DP:GQ:PL 0/0:79:99:0,238,3093
20 10000211 . C T 857.79 . . GT:AD:DP:GQ:PL 0/1:28,27:55:99:888,0,870
20 10000598 . T A 1800.57 . . GT:AD:DP:GQ:PL 1/1:0,48:48:99:1834,144,0
20 10001436 . A AGG,AGGCT 1921.12 . . GT:DP:GQ:PL 0/2:49:84.06:1960,2065,0,2695,222,84
Several things should be noted here:
This genotyping command can be performed independently per sample, and so can be parallelized easily on a farm with one job per sample, as in the following:
foreach sample in samples:
run UnifiedGenotyper command above with -I $sample.bam -o $sample.vcf
end
You can use a similar command for CombineVariants above to merge back together all of your single sample genotyping runs. Suppose all of my UnifiedGenotyper jobs have completed, and I have VCF files named sample1.vcf, sample2.vcf, to sampleN.vcf. The single command:
java -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V:sample1 sample1.vcf -V:sample2 sample2.vcf [repeat until] -V:sampleN sampleN.vcf -o combined.vcf
Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.
The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)
To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.
Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.
You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.
Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.
To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.
The GATK team would love to hear about any applications within which of the GATK-Lite codebase is embedded or walkers which you have chosen to distribute. Please send an email to gsahelp to let us know!
When redistributing the GATK-Lite codebase, please abide by the terms of our copyright:
/*
* Copyright (c) 2009 The Broad Institute
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
* OTHER DEALINGS IN THE SOFTWARE.
*/
The packaging tool in the Sting repository can layout packages for redistribution. Currently, only walkers checked into the GATK's git repository are well supported by the packaging system. Example packaging files can be found in $STING_HOME/packages.
Create a package xml for your project inside $STING_HOME/packages.
Key elements within the package xml include:
executable
Each occurrence of this tag will create an executable jar of the given name tag, using the main method from the given main-class tag.
main-class
This is the main class for the package. When running with java -jar YOUR_JAR.jar, main-class is the class that will be executed.
dependencies
Other dependencies can be of type class or file. If of type class, a dependency analyzer will look for all dependencies of your classes and include those files as well. File dependencies will end up in the root of your package.
resources
Supplemental files can be added to the resources section. Resource files will be copied to the resources directory within the package.
To create a package, execute the following command:
cd $STING_HOME
ant package -Dexecutable=<your executable name>
The packaging system will create a layout directory in dist/packages/<your executable>. Examine the contents of this directory. When you are happy with the results, finalize the package by running the following:
tar cvhjf <your executable>.tar.bz2 <your executable>
To call variants with the GATK using pedigree information, you should base your workflow on the Best Practices recommendations -- the principles detailed there all apply to pedigree analysis.
But there is one crucial addition: you should make sure to pass a pedigree file (PED file) to all GATK walkers that you use in your workflow. Some will deliver better results if they see the pedigree data.
At the moment there are two of the standard annotations affected by pedigree:
In the specific case of trios, an additional GATK walker, PhaseByTransmission, should be used to obtain trio-aware genotypes as well as phase by descent.
The annotations mentioned above have been adapted for PED files starting with GATK v.1.6. If you already have VCF files generated by an older version of the GATK or have not passed a PED file while running the UnifiedGenotyper or VariantAnnotator, you should do the following:
-G StandardAnnotation to VariantAnnotator. Make sure you pass your PED file to the VariantAnnotator as well!The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.
For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.
The GATK provides an implementation of the Per-Base Alignment Qualities (BAQ) developed by Heng Li in late 2010. See this SamTools page for more details.
The BAQ algorithm is applied by the GATK engine itself, which means that all GATK walkers can potentially benefit from it. By default, BAQ is OFF, meaning that the engine will not use BAQ quality scores at all.
The GATK engine accepts the argument -baq with the following enum values:
public enum CalculationMode {
OFF, // don't apply a BAQ at all, the default
CALCULATE_AS_NECESSARY, // do HMM BAQ calculation on the fly, as necessary, if there's no tag
RECALCULATE // do HMM BAQ calculation on the fly, regardless of whether there's a tag present
}
If you want to enable BAQ, the usual thing to do is CALCULATE_AS_NECESSARY, which will calculate BAQ values if they are not in the BQ read tag. If your reads are already tagged with BQ values, then the GATK will use those. RECALCULATE will always recalculate the BAQ, regardless of the tag, which is useful if you are experimenting with the gap open penalty (see below).
If you are really an expert, the GATK allows you to specify the BAQ gap open penalty (-baqGOP) to use in the HMM. This value should be 40 by default, a good value for whole genomes and exomes for highly sensitive calls. However, if you are analyzing exome data only, you may want to use 30, which seems to result in more specific call set. We continue to play with these values some. Some walkers, where BAQ would corrupt their analyses, forbid the use of BAQ and will throw an exception if -baq is provided.
For UnifiedGenotyper to get more specific SNP calls.
For PrintReads to write out a BAM file with BAQ tagged reads
For TableRecalibrator or IndelRealigner to write out a BAM file with BAQ tagged reads. Make sure you use -baq RECALCULATE so the engine knows to recalculate the BAQ after these tools have updated the base quality scores or the read alignments. Note that both of these tools will not use the BAQ values on input, but will write out the tags for analysis tools that will use them.
Note that some tools should not have BAQ applied to them.
This last option will be a particularly useful for people who are already doing base quality score recalibration. Suppose I have a pipeline that does:
RealignerTargetCreator
IndelRealigner
CountCovariates
TableRecalibrate
UnifiedGenotyper
A highly efficient BAQ extended pipeline would look like
RealignerTargetCreator
IndelRealigner // don't bother with BAQ here, since we will calculate it in table recalibrator
CountCovariates
TableRecalibrate -baq RECALCULATE // now the reads will have a BAQ tag added. Slows the tool down some
UnifiedGenotyper -baq CALCULATE_AS_NECESSARY // UG will use the tags from TableRecalibrate, keeping UG fast
Walkers can control via the @BAQMode annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (ON_INPUT, the default), to output reads (ON_OUTPUT), or HANDLED_BY_WALKER, which means that calling into the BAQ system is the responsibility of the individual walker.
These are the most popular Queue command line options. For a complete and up to date list run with -help. QScripts may also add additional command line options.
| Command Line Argument | Description | Default |
|---|---|---|
-run |
If passed the scripts are run. If not passed a dry run is executed. | dry run |
-jobRunner <jobrunner> |
The job runner to dispatch jobs. Setting to Lsf706, GridEngine, or Drmaa will dispatch jobs to LSF or Grid Engine using the job settings (see below). Defaults to Shell which runs jobs on a local shell one at a time. |
Shell |
-bsub |
Alias for -jobRunner Lsf706 |
not set |
-qsub |
Alias for -jobRunner GridEngine |
not set |
-status |
Prints out a summary progress. If a QScript is currently running via -run, you can run the same command line with -status instead to print a summary of progress. |
not set |
-retry <count> |
Retries a QFunction that returns a non-zero exit code up to count times. The QFunction must not have set jobRestartable to false. |
0 = no retries |
-startFromScratch |
Restarts the graph from the beginning. If not specified for each output file specified on a QFunction, ex: /path/to/output.file, Queue will not re-run the job if a .done file is found for the all the outputs, ex: /path/to/.output.file.done. |
use .done files to determine if jobs are complete |
-keepIntermediates |
By default Queue deletes the output files of QFunctions that set .isIntermediate to true. |
delete intermediate files |
-statusTo <email> |
Email address to send status to whenever a) A job fails, or b) Queue has run all the functions it can run and is exiting. | not set |
-statusFrom <email> |
Email address to send status emails from. | user@local.domain |
-dot <file> |
If set renders the job graph to a dot file. | not rendered |
-l <logging_level> |
The minimum level of logging, DEBUG, INFO, WARN, or FATAL. |
INFO |
-log <file> |
Sets the location to save log output in addition to standard out. | not set |
-debug |
Set the logging to include a lot of debugging information (SLOW!) | not set |
-jobReport |
Path to write the job report text file. If R is installed and available on the $PATH then a pdf will be generated visualizing the job report. |
jobPrefix.jobreport.txt |
-disableJobReport |
Disables writing the job report. | not set |
-help |
Lists all of the command line arguments with their descriptions. | not set |
The following options can be specified on the command line over overridden per QFunction.
| Command Line Argument | QFunction Property | Description | Default |
|---|---|---|---|
-jobPrefix |
.jobName |
The unique name of the job. Used to prefix directories and log files. Use -jobNamePrefix on the Queue command line to replace the default prefix Q-<processid>@<host>. |
<jobNamePrefix>-<jobNumber> |
| N/A | .jobOutputFile |
Captures stdout and if jobErrorFile is null it captures stderr as well. |
<jobName>.out |
| N/A | .jobErrorFile |
If not null captures stderr. | null |
| N/A | .commandDirectory |
The directory to execute the command line from. | current directory |
-jobProject |
.jobProject |
The project name for the job. | default job runner project |
-jobQueue |
.jobQueue |
The queue to dispatch the job. | default job runner queue |
-jobPriority |
.jobPriority |
The dispatch priority for the job. Lowest priority = 0. Highest priority = 100. |
default job runner priority |
-jobNative |
.jobNativeArgs |
Native args to pass to the job runner. Currently only supported in GridEngine and Drmaa. The string is concatenated to the native arguments passed over DRMAA. Example: -w n. |
none |
-jobResReq |
.jobResourceRequests |
Resource requests to pass to the job runner. On GridEngine this is multiple -l <req>. On LSF a single -R <req> is generated. |
memory reservations and limits on LSF and GridEngine |
-jobEnv |
.jobEnvironmentNames |
Predefined environment names to pass to the job runner. On GridEngine this is -pe <env>. On LSF this is -a <env>. |
none |
-memLimit |
.memoryLimit |
The memory limit for the job in gigabytes. Used to populate the variables residentLimit and residentRequest which can also be set separately. | default job runner memory limit |
-resMemLimit |
.residentLimit |
Limit for the resident memory in gigabytes. On GridEngine this is -l mem_free=<mem>. On LSF this is -R rusage[mem=<mem>]. |
memoryLimit * 1.2 |
-resMemReq |
.residentRequest |
Requested amount of resident memory in gigabytes. On GridEngine this is -l h_rss=<mem>. On LSF this is -R rusage[select=<mem>]. |
memoryLimit |
| Command Line Argument | Description | Default |
|---|---|---|
-emailHost <hostname> |
SMTP host name | localhost |
-emailPort <port> |
SMTP port | 25 |
-emailTLS |
If set uses TLS. | not set |
-emailSSL |
If set uses SSL. | not set |
-emailUser <username> |
If set along with emailPass or emailPassFile authenticates the email with this username. | not set |
-emailPassFile <file> |
If emailUser is also set authenticates the email with contents of the file. | not set |
-emailPass <password> |
If emailUser is also set authenticates the email with this password. NOT SECURE: Use emailPassFile instead! | not set |
Thanks to contributions from the community, Queue contains a job runner compatible with Grid Engine 6.2u5.
As of July 2011 this is the currently known list of forked distributions of Sun's Grid Engine 6.2u5. As long as they are JDRMAA 1.0 source compatible with Grid Engine 6.2u5, the compiled Queue code should run against each of these distributions. However we have yet to receive confirmation that Queue works on any of these setups.
Our internal QScript integration tests run the same tests on both LSF 7.0.6 and a Grid Engine 6.2u5 cluster setup on older software released by Sun.
If you run into trouble, please let us know. If you would like to contribute additions or bug fixes please create a fork in our github repo where we can review and pull in the patch.
Try out the Hello World example with -jobRunner GridEngine.
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run
If all goes well Queue should dispatch the job to Grid Engine and wait until the status returns RunningStatus.DONE and "hello world should be echoed into the output file, possibly with other grid engine log messages.
See QFunction and Command Line Options for more info on Queue options.
If you run into an error with Queue submitting jobs to GridEngine, first try submitting the HelloWorld example with -memLimit 2:
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run -memLimit 2
Then try the following GridEngine qsub commands. They are based on what Queue submits via the API when running the HelloWorld.scala example with and without memory reservations and limits:
qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y echo hello world
qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y \
-l mem_free=2048M -l h_rss=2458M echo hello world
One other thing to check is if there is a memory limit on your cluster. For example try submitting jobs with up to 16G.
qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y \
-l mem_free=4096M -l h_rss=4915M echo hello world
qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y \
-l mem_free=8192M -l h_rss=9830M echo hello world
qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y \
-l mem_free=16384M -l h_rss=19960M echo hello world
If the above tests pass and GridEngine will still not dispatch jobs submitted by Queue please report the issue to our support forum.
We have found it that Queue works best with IntelliJ IDEA Community Edition (free) or Ultimate Edition installed with the Scala Plugin enabled. Once you have downloaded IntelliJ IDEA, follow the instructions below to setup a Sting project with Queue and the Scala Plugin.
[[File:sting_project_libraries.png|300px|thumb|right|Project Libraries]] [[File:sting_module_sources.png|300px|thumb|right|Module Sources]] [[File:sting_module_dependencies.png|300px|thumb|right|Module Dependencies]] [[File:sting_module_scala_facet.png|300px|thumb|right|Scala Facet]]
Build Queue from source from the command line with ant queue, so that:
- The lib folder is initialized including the scala jars
- The queue-extensions for the GATK are generated to the build folder
File > SettingsIDE Settings in the left navigation list select PluginsAvailable tab under pluginsscala pluginNo. The correct scala libraries and compiler are already available in the lib folder from when you built Queue from the command lineRestart IntelliJ to load the scala pluginSelect the menu File... > New Project...
On the first page of "New Project" select
Create project from scratch
Click Next >
On the second page of "New Project" select
Set the project Name: to Sting
Set the Project files location: to the directory where you checked out the Sting git repository, for example /Users/jamie/src/Sting
Uncheck Create Module
Click Finish
The "Project Structure" window should open. If not open it via the menu File > Project Structure
Under the Project Settings in the left panel of "Project Structure" select Project
Make sure that Project SDK is set to a build of 1.6
If the Project SDK only lists <No SDK> add a New > JSDK pointing to /System/Library/Frameworks/JavaVM.framework/Versions/1.6
Under the Project Settings in the left panel of "Project Structure" select Libraries
Click the plus (+) to create a new Project Library
Set the Name: to Sting/lib
Select Attach Jar Directories
Select the path to lib folder under your SVN checkout
Under the Project Settings in the left panel of "Project Structure" select Modules
Click on the + box to add a new module
On the first page of "Add Module" select
Create module from scratch
Click Next \>
On the second page of "Add Module" select
Set the module Name: to Sting
Change the Content root to: <directory where you checked out the Sting SVN repository>
Click Next \>
On the third page
Uncheck all of the other source directories only leaving the java/src directory checked
Click Next \>
On fourth page click Finish
Back in the Project Structure window, under the Module 'Sting', on the Sources tab make sure the following folders are selected
Source Folders (in blue):
public/java/src
public/scala/src
private/java/src (Broad only)
private/scala/src (Broad only)
build/queue-extensions/srcTest Source Folders (in green):
public/java/test
public/scala/test
private/java/test (Broad only)
private/scala/test (Broad only)In the Project Structure window, under the Module 'Sting', on the Module Dependencies tab select
Click on the button Add...
Select the popup menu Library...
Select the Sting/lib library
Click Add selected
Refresh the Project Structure window so that it becomes aware of the Scala library in Sting/lib
Click the OK button
Reopen Project Structure via the menu File > Project Structure
In the second panel, click on the Sting module
Click on the plus (+) button above the second panel module
In the popup menu under Facet select Scala
On the right under Facet 'Scala' set the Compiler library: to Sting/lib
Click OK
File > SettingsProject Settings [Sting] in the left navigation list select Compiler then Annotation ProcessorsEnable annotation processingobtain processors from the classpath selectedOK[[File:queue_debug.png|300px|thumb|right|Queue Remote Debug]]
In IntelliJ 10 open the menu Run > Edit Configurations.
Click the gold [+] button at the upper left to open the Add New Configuration popup menu.
Select Remote from the popup menu.
With the new configuration selected on the left, change the configuration name from 'Unnamed' to something like 'Queue Remote Debug'.
Set the Host to the hostname of your server, and the Port to an unused port. You can try the default port of 5005.
From the Use the following command line arguments for running remote JVM, copy the argument string.
On the server, paste / modify your command line to run with the previously copied text, for example java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005 Queue.jar -S myscript.scala ....
If you would like the program to wait for you to attach the debugger before running, change suspend=n to suspend=y.
Back in IntelliJ, click OK to save your changes.
Queue Remote Debug is selected via the configuration drop down or Run > Edit Configurations.Run > Debug.From Stack overflow:
Point IntelliJ to http://download.oracle.com/javase/6/docs/api/.
Go to File -> Project Structure -> SDKs -> Apple 1.x -> DocumentationPaths, and the click specify URL.
In IntelliJ, open File -> Project Structure. Click on "SDKs" under "Platform Settings". Add the following path under the Sourcepath tab: /Library/Java/JavaVirtualMachines/1.6.0_29-b11-402.jdk/Contents/Home/src.jar!/src
Reads can be filtered out of traversals by either pileup size through one of our downsampling methods or by read property through our read filtering mechanism. Both techniques and described below.
Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.
The GATK's default downsampler exhibits the following properties:
The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.
The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.
The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.
By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.
From the command line:
To disable the downsampler, specify -dt NONE.
To change the default coverage per-sample, specify the desired coverage to the -dcov option.
To modify the walker's default behavior:
by=<value>. Override the downsampling depth by changing the toCoverage=<value>.The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:
For each sample:
Select
While the number of existing reads + the number of incoming reads is greater than the target sample size:
Walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.
If we have n slots avaiable where n is >= 1, randomly select n of the incoming reads and add them to the pileup.
Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.
To selectively filter out reads before they reach your walker, implement one or multiple net.sf.picard.filter.SamRecordFilter, and attach it to your walker as follows:
@ReadFilters({Platform454Filter.class, ZeroMappingQualityReadFilter.class})
You can add command-line arguments for filters with the @Argument tag, just as with walkers. Here's an example of our new max read length filter:
public class MaxReadLengthFilter implements SamRecordFilter {
@Argument(fullName = "maxReadLength", shortName = "maxRead", doc="Discard reads with length greater than the specified value", required=false)
private int maxReadLength;
public boolean filterOut(SAMRecord read) { return read.getReadLength() > maxReadLength; }
}
Adding this filter to the top of your walker using the @ReadFilters attribute will add a new required command-line argument, maxReadLength, which will filter reads > maxReadLength before your walker is called.
Note that when you specify a read filter, you need to strip the Filter part of its name off! E.g. in the example above, if you want to use MaxReadLengthFilter, you need to call it like this:
--read_filter MaxReadLength
The --read-filter argument will allow you to apply whatever read filters you'd like to your dataset, before the reads reach your walker. To add the MaxReadLength filter above to PrintReads, you'd add the command line parameters:
--read_filter MaxReadLength --maxReadLength 76
You can add as many filters as you like by using multiple copies of the --read_filter parameter:
--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead
This script can be used for sorting an input file based on a reference.
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
sub usage {
print "\nUsage:\n";
print "sortByRef.pl [--k POS] INPUT REF_DICT\n\n";
print " Sorts lines of the input file INFILE according\n";
print " to the reference contig order specified by the\n";
print " reference dictionary REF_DICT (.fai file).\n";
print " The sort is stable. If -k option is not specified,\n";
print " it is assumed that the contig name is the first\n";
print " field in each line.\n\n";
print " INPUT input file to sort. If '-' is specified, \n";
print " then reads from STDIN.\n";
print " REF_DICT .fai file, or ANY file that has contigs, in the\n";
print " desired soting order, as its first column.\n";
print " --k POS : contig name is in the field POS (1-based)\n";
print " of input lines.\n\n";
exit(1);
}
my $pos = 1;
GetOptions( "k:i" => \$pos );
$pos--;
usage() if ( scalar(@ARGV) == 0 );
if ( scalar(@ARGV) != 2 ) {
print "Wrong number of arguments\n";
usage();
}
my $input_file = $ARGV[0];
my $dict_file = $ARGV[1];
open(DICT, "< $dict_file") or die("Can not open $dict_file: $!");
my %ref_order;
my $n = 0;
while ( <DICT> ) {
chomp;
my ($contig, $rest) = split "\t";
die("Dictionary file is probably corrupt: multiple instances of contig $contig") if ( defined $ref_order{$contig} );
$ref_order{$contig} = $n;
$n++;
}
close DICT;
#we have loaded contig ordering now
my $INPUT;
if ($input_file eq "-" ) {
$INPUT = "STDIN";
} else {
open($INPUT, "< $input_file") or die("Can not open $input_file: $!");
}
my %temp_outputs;
while ( <$INPUT> ) {
my @fields = split '\s';
die("Specified field position exceeds the number of fields:\n$_")
if ( $pos >= scalar(@fields) );
my $contig = $fields[$pos];
if ( $contig =~ m/:/ ) {
my @loc = split(/:/, $contig);
# print $contig . " " . $loc[0] . "\n";
$contig = $loc[0]
}
chomp $contig if ( $pos == scalar(@fields) - 1 ); # if last field in line
my $order;
if ( defined $ref_order{$contig} ) { $order = $ref_order{$contig}; }
else {
$order = $n; # input line has contig that was not in the dict;
$n++; # this contig will go at the end of the output,
# after all known contigs
}
my $fhandle;
if ( defined $temp_outputs{$order} ) { $fhandle = $temp_outputs{$order} }
else {
#print "opening $order $$ $_\n";
open( $fhandle, " > /tmp/sortByRef.$$.$order.tmp" ) or
die ( "Can not open temporary file $order: $!");
$temp_outputs{$order} = $fhandle;
}
# we got the handle to the temp file that keeps all
# lines with contig $contig
print $fhandle $_; # send current line to its corresponding temp file
}
close $INPUT;
foreach my $f ( values %temp_outputs ) { close $f; }
# now collect back into single output stream:
for ( my $i = 0 ; $i < $n ; $i++ ) {
# if we did not have any lines on contig $i, then there's
# no temp file and nothing to do
next if ( ! defined $temp_outputs{$i} ) ;
my $f;
open ( $f, "< /tmp/sortByRef.$$.$i.tmp" );
while ( <$f> ) { print ; }
close $f;
unlink "/tmp/sortByRef.$$.$i.tmp";
}
The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library.
Tribble provides a lightweight interface and API for querying features and creating indexes from feature files, while allowing iteration over know feature files that we're unable to create indexes for. The main entry point for external users is the BasicFeatureReader class. It takes in a codec, an index file, and a file containing the features to be processed. With an instance of a BasicFeatureReader, you can query for features that span a specific location, or get an iterator over all the records in the file.
For developers, there are two important classes to implement: the FeatureCodec, which decodes lines of text and produces features, and the feature class, which is your underlying record type.

For developers there are two classes that are important:
Feature
This is the genomicly oriented feature that represents the underlying data in the input file. For instance in the VCF format, this is the variant call including quality information, the reference base, and the alternate base. The required information to implement a feature is the chromosome name, the start position (one based), and the stop position. The start and stop position represent a closed, one-based interval. I.e. the first base in chromosome one would be chr1:1-1.
FeatureCodec
This class takes in a line of text (from an input source, whether it's a file, compressed file, or a http link), and produces the above feature.
To implement your new format into Tribble, you need to implement the two above classes (in an appropriately named subfolder in the Tribble check-out). The Feature object should know nothing about the file representation; it should represent the data as an in-memory object. The interface for a feature looks like:
public interface Feature {
/**
* Return the features reference sequence name, e.g chromosome or contig
*/
public String getChr();
/**
* Return the start position in 1-based coordinates (first base is 1)
*/
public int getStart();
/**
* Return the end position following 1-based fully closed conventions. The length of a feature is
* end - start + 1;
*/
public int getEnd();
}
And the interface for FeatureCodec:
/**
* the base interface for classes that read in features.
* @param <T> The feature type this codec reads
*/
public interface FeatureCodec<T extends Feature> {
/**
* Decode a line to obtain just its FeatureLoc for indexing -- contig, start, and stop.
*
* @param line the input line to decode
* @return Return the FeatureLoc encoded by the line, or null if the line does not represent a feature (e.g. is
* a comment)
*/
public Feature decodeLoc(String line);
/**
* Decode a line as a Feature.
*
* @param line the input line to decode
* @return Return the Feature encoded by the line, or null if the line does not represent a feature (e.g. is
* a comment)
*/
public T decode(String line);
/**
* This function returns the object the codec generates. This is allowed to be Feature in the case where
* conditionally different types are generated. Be as specific as you can though.
*
* This function is used by reflections based tools, so we can know the underlying type
*
* @return the feature type this codec generates.
*/
public Class<T> getFeatureType();
/** Read and return the header, or null if there is no header.
*
* @return header object
*/
public Object readHeader(LineReader reader);
}
The following formats are supported in Tribble:
Updating the revision of Tribble on the system is a relatively straightforward task if the following steps are taken.
Make sure that you've checked your changes into Tribble; unversioned changes will be problematic, so you should always check in so that you have a unique version number to identify your release.
Once you've checked-in Tribble, make sure to svn update, and then run svnversion. This will give you a version number which you can use to name your release. Let's say it was 82. **If it contains an M (i.e. 82M) this means your version isn't clean (you have modifications that are not checked in), don't proceed`.
from the Tribble main directory, run ant clean, then ant (make sure it runs successfully), and ant test (also make sure it completes successfully).
copy dist/tribble-0.1.jar (or whatever the internal Tribble version currently is) to your checkout of the GATK, as the file ./settings/repository/org.broad/tribble-<svnversion>.jar.
Copy the current XML file to the new name, i.e. from the base GATK trunk directory:
cp ./settings/repository/org.broad/tribble-
Edit the ./settings/repository/org.broad/tribble-<svnversion>.xml with the new correct version number and release date (here we rev 81 to 82).
This involves changing:
<ivy-module version="1.0">
<info organisation="org.broad" module="tribble" revision="81" status="integration" publication="20100526124200" />
</ivy-module>
To:
<ivy-module version="1.0">
<info organisation="org.broad" module="tribble" revision="82" status="integration" publication="20100528123456" />
</ivy-module>
Notice the change to the revision number and the publication date.
Notice the change to the revision number and the publication date.
Remove the old files svn remove ./settings/repository/org.broad/tribble-<current_svnversion>.*
Add the new files svn add ./settings/repository/org.broad/tribble-<new_svnversion>.*
Make sure you're using the new libraries to build: remove your ant cache: rm -r ~/.ant/cache.
Run an ant clean, and then make sure to test the build with ant integrationtest and ant test.
Any check-in from the base SVN directory will now rev the Tribble version.
For a complete, detailed argument reference, refer to the GATK document page here.
DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:
Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.
For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.
DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.
To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument
-geneList /path/to/gene/list.txt
The provided gene list must be of the following format:
585 NM_001005484 chr1 + 58953 59871 58953 59871 1 58953, 59871, 0 OR4F5 cmpl cmpl 0,
587 NM_001005224 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F3 cmpl cmpl 0,
587 NM_001005277 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F16 cmpl cmpl 0,
587 NM_001005221 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F29 cmpl cmpl 0,
589 NM_001005224 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F3 cmpl cmpl 0,
589 NM_001005277 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F16 cmpl cmpl 0,
589 NM_001005221 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F29 cmpl cmpl 0,
If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at
/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt
If you supply the -geneList argument, DepthOfCoverage v3.0 will output an additional summary file that looks as follows:
Gene_Name Total_Cvg Avg_Cvg Sample_1_Total_Cvg Sample_1_Avg_Cvg Sample_1_Cvg_Q3 Sample_1_Cvg_Median Sample_1_Cvg_Q1
SORT1 594710 238.27 594710 238.27 165 245 330
NOTCH2 3011542 357.84 3011542 357.84 222 399 >500
LMNA 563183 186.73 563183 186.73 116 187 262
NOS1AP 513031 203.50 513031 203.50 91 191 290
Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.
Introduction
SelectVariants is a GATK tool used to subset a VCF file by many arbitrary criteria listed in the command line options below. The output VCF wiil have the AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) annotations updated as necessary to accurately reflect the file's new contents.
Select Variants operates on VCF files (ROD Tracks) provided in the command line using the GATK's built in --variant option. You can provide multiple tracks for Select Variants but at least one must be named 'variant' and this will be the file all your analysis will be based of. Other tracks can be named as you please. Options requiring a reference to a ROD track name will use the track name provided in the -B option to refer to the correct VCF file (e.g. --discordance / --concordance ). All other analysis will be done in the 'variant' track.
Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented here in the FAQ article on JEXL expressions; it is particularly important to note the section on working with complex expressions.
For a complete, detailed argument reference, refer to the GATK document page here.
Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.
BOB MARY LINDA
1/0:20 0/0:30 1/1:50
In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).
Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:
BOB MARY
1/0:20 0/0:30
The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.
Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like
BOB MARY
1/0:20 ./.:.
with AN=2, AC=1, AF=0.5, and DP=20.
SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:
1 82154 rs4477212 A G . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205
1 534247 SNP1-524110 C T . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471
1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942
Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().
isVariant => is there an ALT allele?
isPolymorphic => is some sample non-ref in the samples?
In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.
For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:
1 82154 rs4477212 A . . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205
1 534247 SNP1-524110 C . . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471
1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942
If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.
Some VCFs may have repeated header entries with the same key name, for instance:
##fileformat=VCFv3.3
##FILTER=ABFilter,"AB > 0.75"
##FILTER=HRunFilter,"HRun > 3.0"
##FILTER=QDFilter,"QD < 5.0"
##UG_bam_file_used=file1.bam
##UG_bam_file_used=file2.bam
##UG_bam_file_used=file3.bam
##UG_bam_file_used=file4.bam
##UG_bam_file_used=file5.bam
##source=UnifiedGenotyper
##source=VariantFiltration
##source=AnnotateVCFwithMAF
...
Here, the "UG_bam_file_used" and "source" header lines appear multiple times. When SelectVariants is run on such a file, the program will emit warnings that these repeated header lines are being discarded, resulting in only the first instance of such a line being written to the resulting VCF. This behavior is not ideal, but expected under the current architecture.
For information on how to construct regular expressions for use with this tool, see the "Summary of regular-expression constructs" section here.
2 SNPs with significant strand bias

Several SNPs with excessive coverage

For a complete, detailed argument reference, refer to the GATK document page here.
In addition to true variation, variant callers emit a number of false-positives. Some of these false-positives can be detected and rejected by various statistical tests. VariantAnnotator provides a way of annotating variant calls as preparation for executing these tests.
Description of the haplotype score annotation

The list below is not comprehensive. Please use the --list argument to get a list of all possible annotations available. Also, see the FAQ article on understanding the Unified Genotyper's VCF files for a description of some of the more standard annotations.
Note that technically the VariantAnnotator does not require reads (from a BAM file) to run; if no reads are provided, only those Annotations which don't use reads (e.g. Chromosome Counts) will be added. But most Annotations do require reads. When running the tool we recommend that you add the -L argument with the variant rod to your command line for efficiency and speed.
JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.
In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.
JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:
"QUAL > 30.0"
QUAL is a key: the name of the annotation we want to look at30.0 is a value: the threshold that we want to use to evaluate variant quality against> is an operator: it determines which "side" of the threshold we want to selectThe complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:
"MY_STRING_KEY == 'foo'"
You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:
"QUAL / DP < 10.0"
You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):
"QUAL > 30.0 && DP == 10"
where && is the logical "AND".
Or if you want to select variants that have at least one of several conditions fulfilled:
"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
where || is the logical "OR".
It is very important to note that the JEXL evaluation subprogram cannot correctly handle cases where the annotations requested by the JEXL expression are missing for some variants in a VCF record. It will throw an exception (i.e. fail with an error) when it encounters this scenario. The default behavior of the GATK is to handle this by having the entire expression evaluate to FALSE in such cases (although some tools provide options to change this behavior). This is extremely important especially when constructing complex expressions, because it affects how you should interpret the result.
For example, looking again at that last expression:
"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
When run against a VCF record with INFO field
QD=10.0;FS=300.0;ReadPosRankSum=-10.0
it will evaluate to TRUE because the FS value is greater than 200.0.
But when run against a VCF record with INFO field
QD=10.0;FS=300.0
it will evaluate to FALSE because there is no ReadPosRankSum value defined at all and JEXL fails to evaluate it.
This means that when you're trying to filter out records with VariantFiltration, for example, the previous record would be marked as PASSing, even though it contains a bad FS value.
For this reason, we highly recommend that complex expressions involving OR operations be split up into separate expressions whenever possible. For example, the previous example would have 3 distinct expressions: "QD < 2.0", "ReadPosRankSum < -20.0", and "FS > 200.0". This way, although the ReadPosRankSum expression evaluates to FALSE when the annotation is missing, the record can still get filtered (again using the example of VariantFiltration) when the FS value is greater than 200.0.
Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.
The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).
Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.
If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the vc variable.
For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'
Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:
! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263
The classic way of evaluating a boolean goes like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'
But you can also use the VariantContext object like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'
Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'
VariantEval accepts two types of modules: stratification and evaluation modules.
CpG is a three-state stratification:
A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.
EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.
Sample is an N-state stratification, where N is the number of samples in the eval files.
Filter is a three-state stratification:
FunctionalClass is a four-state stratification:
CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.
Degeneracy is a six-state stratification:
See the [http://en.wikipedia.org/wiki/Genetic_code#Degeneracy Wikipedia page on degeneracy] for more information.
JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]
Novelty is a three-state stratification:
CountVariants is an evaluation module that computes the following metrics:
| Metric | Definition |
|---|---|
| nProcessedLoci | Number of processed loci |
| nCalledLoci | Number of called loci |
| nRefLoci | Number of reference loci |
| nVariantLoci | Number of variant loci |
| variantRate | Variants per loci rate |
| variantRatePerBp | Number of variants per base |
| nSNPs | Number of snp loci |
| nInsertions | Number of insertion |
| nDeletions | Number of deletions |
| nComplex | Number of complex loci |
| nNoCalls | Number of no calls loci |
| nHets | Number of het loci |
| nHomRef | Number of hom ref loci |
| nHomVar | Number of hom var loci |
| nSingletons | Number of singletons |
| heterozygosity | heterozygosity per locus rate |
| heterozygosityPerBp | heterozygosity per base pair |
| hetHomRatio | heterozygosity to homozygosity ratio |
| indelRate | indel rate (insertion count + deletion count) |
| indelRatePerBp | indel rate per base pair |
| deletionInsertionRatio | deletion to insertion ratio |
CompOverlap is an evaluation module that computes the following metrics:
| Metric | Definition |
|---|---|
| nEvalSNPs | number of eval SNP sites |
| nCompSNPs | number of comp SNP sites |
| novelSites | number of eval sites outside of comp sites |
| nVariantsAtComp | number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same) |
| compRate | percentage of eval sites at comp sites |
| nConcordant | number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele) |
| concordantRate | the concordance rate |
A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):
$ grep -v '##' dbsnp.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10327 rs112750067 T C . . ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132
Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:
$ grep -v '##' eval_correct_allele.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6
1 10327 . T C 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059
Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:
$ grep -v '##' eval_incorrect_allele.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6
1 10327 . T A 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059
Running VariantEval with just the CompOverlap module:
$ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \
-R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \
-L 1:10327 \
-B:dbsnp,VCF dbsnp.vcf \
-B:eval_correct_allele,VCF eval_correct_allele.vcf \
-B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \
-noEV \
-EV CompOverlap \
-o eval.table
We find that the eval.table file contains the following:
$ grep -v '##' eval.table | column -t
CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate
CompOverlap dbsnp eval_correct_allele none all 1 1 0 1 100.00000000 1 100.00000000
CompOverlap dbsnp eval_correct_allele none known 1 1 0 1 100.00000000 1 100.00000000
CompOverlap dbsnp eval_correct_allele none novel 0 0 0 0 0.00000000 0 0.00000000
CompOverlap dbsnp eval_incorrect_allele none all 1 1 0 1 100.00000000 0 0.00000000
CompOverlap dbsnp eval_incorrect_allele none known 1 1 0 1 100.00000000 0 0.00000000
CompOverlap dbsnp eval_incorrect_allele none novel 0 0 0 0 0.00000000 0 0.00000000
As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.
TiTvVariantEvaluator is an evaluation module that computes the following metrics:
| Metric | Definition |
|---|---|
| nTi | number of transition loci |
| nTv | number of transversion loci |
| tiTvRatio | the transition to transversion ratio |
| nTiInComp | number of comp transition sites |
| nTvInComp | number of comp transversion sites |
| TiTvRatioStandard | the transition to transversion ratio for comp sites |
Since September, 2010, the GATK has had a "phone-home" feature that sends us information about each GATK run via the Broad filesystem (within the Broad) and Amazon's S3 cloud storage service (outside the Broad). This feature is enabled by default.
The information provided by the phone-home feature is critical in driving improvements to the GATK
Below are two example GATK Run Reports showing exactly what information is sent to us each time the GATK phones home.
<GATK-run-report>
<id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
<start-time>2012/03/10 20.21.19</start-time>
<end-time>2012/03/10 20.21.19</end-time>
<run-time>0</run-time>
<walker-name>CountReads</walker-name>
<svn-version>1.4-483-g63ecdb2</svn-version>
<total-memory>85000192</total-memory>
<max-memory>129957888</max-memory>
<user-name>depristo</user-name>
<host-name>10.0.1.10</host-name>
<java>Apple Inc.-1.6.0_26</java>
<machine>Mac OS X-x86_64</machine>
<iterations>105</iterations>
</GATK-run-report>
<GATK-run-report>
<id>yX3AnltsqIlXH9kAQqTWHQUd8CQ5bikz</id>
<exception>
<message>Failed to parse Genome Location string: 20:10,000,000-10,000,001x</message>
<stacktrace class="java.util.ArrayList">
<string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:377)</string>
<string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
<string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
<string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
<string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
</stacktrace>
<cause>
<message>Position: '10,000,001x' contains invalid chars.</message>
<stacktrace class="java.util.ArrayList">
<string>org.broadinstitute.sting.utils.GenomeLocParser.parsePosition(GenomeLocParser.java:411)</string>
<string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:374)</string>
<string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
<string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
<string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
<string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
</stacktrace>
<is-user-exception>false</is-user-exception>
</cause>
<is-user-exception>true</is-user-exception>
</exception>
<start-time>2012/03/10 20.19.52</start-time>
<end-time>2012/03/10 20.19.52</end-time>
<run-time>0</run-time>
<walker-name>CountReads</walker-name>
<svn-version>1.4-483-g63ecdb2</svn-version>
<total-memory>85000192</total-memory>
<max-memory>129957888</max-memory>
<user-name>depristo</user-name>
<host-name>10.0.1.10</host-name>
<java>Apple Inc.-1.6.0_26</java>
<machine>Mac OS X-x86_64</machine>
<iterations>0</iterations>
</GATK-run-report>
Note that as of GATK 1.5 we no longer collect information about the command-line executed, the working directory, or tmp directory.
The GATK is currently in the process of evolving to require interaction with Amazon S3 as a normal part of each run. For this reason, and because the information contained in the GATK run reports is so critical in driving improvements to the GATK, we strongly discourage our users from disabling the phone-home feature.
At the same time, we recognize that some of our users do have legitimate reasons for needing to run the GATK with phone-home disabled, and we don't wish to make it impossible for these users to run the GATK.
Technical reasons: Your local network might have restrictions in place that don't allow the GATK to access external resources, or you might need to run the GATK in a network-less environment.
Organizational reasons: Your organization's policies might forbid the dissemination of one or more pieces of information contained in the GATK run report.
For such users we have provided an -et NO_ET option in the GATK to disable the phone-home feature. To use this option in GATK 1.5 and later, you need to contact us to request a key. Instructions for doing so are below.
To obtain a GATK key, please fill out the request form.
Running the GATK with a key is simple: you just need to append a -K your.key argument to your customary command line, where your.key is the path to the key file you obtained from us:
java -jar dist/GenomeAnalysisTK.jar \
-T PrintReads \
-I public/testdata/exampleBAM.bam \
-R public/testdata/exampleFASTA.fasta \
-et NO_ET \
-K your.key
The -K argument is only necessary when running the GATK with the NO_ET option.
If you get an error message from the GATK saying that your key is corrupt, unreadable, or has been revoked, please email '''gsahelp@broadinstitute.org''' to ask for a replacement key.
If you get an error message stating that the GATK public key could not be located or read, then something is likely wrong with your build of the GATK. If you're running the binary release, try downloading it again. If you're compiling from source, try doing an ant clean and re-compiling. If all else fails, please ask for help on our community forum.
We use the phone home data for three main purposes. First, we monitor the input logs for errors that occur in the GATK, and proactively fix them in the codebase. Second, we monitor the usage rates of the GATK in general and specific versions of the GATK to explain how widely used the GATK is to funding agencies and other potential supporters. Finally, we monitor adoption rates of specific GATK tools to understand how quickly new tools reach our users. Many of these analyses require us to aggregate the data by unique user, which is why we still collect the username of the individual who ran the GATK (as you can see in the plots). Examples of all three uses are shown in the Tableau graphs below, which update each night and are sent to the GATK members each morning for review.
A GATKReport is simply a text document that contains well-formatted, easy to read representation of some tabular data. Many GATK tools output their results as GATKReports, so it's important to understand how they are formatted and how you can use them in further analyses.
Here's a simple example:
#:GATKReport.v1.0:2
#:GATKTable:true:2:9:%.18E:%.15f:;
#:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads
cycle errorrate.61PA8.7 qualavg.61PA8.7
0 7.451835696110506E-3 25.474613284804366
1 2.362777171937477E-3 29.844949954504095
2 9.087604507451836E-4 32.875909752547310
3 5.452562704471102E-4 34.498999090081895
4 9.087604507451836E-4 35.148316651501370
5 5.452562704471102E-4 36.072234352256190
6 5.452562704471102E-4 36.121724890829700
7 5.452562704471102E-4 36.191048034934500
8 5.452562704471102E-4 36.003457059679770
#:GATKTable:false:2:3:%s:%c:;
#:GATKTable:TableName:Description
key column
1:1000 T
1:1001 A
1:1002 C
This report contains two individual GATK report tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data.
We provide an R library called gsalib that allows you to load GATKReport files into R for further analysis. Here are the five simple steps to getting gsalib, installing it and loading a report.
Please visit the Downloads page for instructions.
gsalib library$ ant gsalib
Buildfile: build.xml
gsalib:
[exec] * installing *source* package ?gsalib? ...
[exec] ** R
[exec] ** data
[exec] ** preparing package for lazy loading
[exec] ** help
[exec] *** installing help indices
[exec] ** building package indices ...
[exec] ** testing if installed package can be loaded
[exec]
[exec] * DONE (gsalib)
BUILD SUCCESSFUL
$ cat .Rprofile
.libPaths("/path/to/Sting/R/")
$ R
R version 2.11.0 (2010-04-22)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(gsalib)
> d = gsa.read.gatkreport("/path/to/my.gatkreport")
> summary(d)
Length Class Mode
CountVariants 27 data.frame list
CompOverlap 13 data.frame list
We have sequenced at the Broad Institute and released to the 1000 Genomes Project the following datasets for the three members of the CEU trio (NA12878, NA12891 and NA12892):
This is better data to work with than the original DePristo et al. BAMs files, so we recommend you download and analyze these files if you are looking for complete, large-scale data sets to evaluate the GATK or other tools.
Here's the rough library properties of the BAMs:

These data files can be downloaded from the 1000 Genomes DCC
Here are the datasets we used in the GATK paper cited below.
DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D and Daly, M (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 43:491-498.
Some of the BAM and VCF files are currently hosted by the NCBI: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/
FILTER field status-- targets used in the analysis of the exome capture dataPlease note that we have not collected the indel calls for the paper, as these are only used for filtering SNPs near indels. If you want to call accurate indels, please use the new GATK indel caller in the Unified Genotyper.
Both the GATK and the sequencing technologies have improved significantly since the analyses performed in this paper.
If you are conducting a review today, we would recommend that the newest version of the GATK, which performs much better than the version described in the paper. Moreover, we would also recommend one use the newest version of Crossbow as well, in case they have improved things. The GATK calls for NA12878 from the paper (above) will give one a good idea what a good call set looks like whole-genome or whole-exome.
The data sets used in the paper are no longer state-of-the-art. The WEx BAM is GAII data aligned with MAQ on hg18, but a state-of-the-art data set would use HiSeq and BWA on hg19. Even the 64x HiSeq WG data set is already more than one year old. For a better assessment, we would recommend you use a newer data set for these samples, if you have the capacity to generate it. This applies less to the WG NA12878 data, which is pretty good, but the NA12878 WEx from the paper is nearly 2 years old now and notably worse than our most recent data sets.
Obviously, this was an annoyance for us as well, as it would have been nice to use a state-of-the-art data set for the WEx. But we decided to freeze the data used for analysis to actually finish this paper.
If you want the raw, machine output for the data analyzed in the GATK framework paper, obtain the raw BAM files above and convert them from SAM to FASTQ using the Picard tool SamToFastq.
Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs), so before you post this issue in our support forum you will first need to do a little investigation on your own.
To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.
Here is a checklist of questions you should ask yourself:
The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.
Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.
Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.
A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!
Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.
By default the UG will only consider a certain number of alternate alleles. This value can be set using the --max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.
SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.
The GATKDocs are what we call "Technical Documentation" in the Guide section of this website. The HTML pages are generated automatically at build time from specific blocks of documentation in the source code.
The best place to look for example documentation for a GATK walker is GATKDocsExample walker in org.broadinstitute.sting.gatk.examples. This is available here.
Below is the reproduction of that file from August 11, 2011:
/**
* [Short one sentence description of this walker]
*
* <p>
* [Functionality of this walker]
* </p>
*
* <h2>Input</h2>
* <p>
* [Input description]
* </p>
*
* <h2>Output</h2>
* <p>
* [Output description]
* </p>
*
* <h2>Examples</h2>
* PRE-TAG
* java
* -jar GenomeAnalysisTK.jar
* -T $WalkerName
* PRE-TAG
*
* @category Walker Category
* @author Your Name
* @since Date created
*/
public class GATKDocsExample extends RodWalker<Integer, Integer> {
/**
* Put detailed documentation about the argument here. No need to duplicate the summary information
* in doc annotation field, as that will be added before this text in the documentation page.
*
* Notes:
* <ul>
* <li>This field can contain HTML as a normal javadoc</li>
* <li>Don't include information about the default value, as gatkdocs adds this automatically</li>
* <li>Try your best to describe in detail the behavior of the argument, as ultimately confusing
* docs here will just result in user posts on the forum</li>
* </ul>
*/
@Argument(fullName="full", shortName="short", doc="Brief summary of argument [~ 80 characters of text]", required=false)
private boolean myWalkerArgument = false;
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) { return 0; }
public Integer reduceInit() { return 0; }
public Integer reduce(Integer value, Integer sum) { return value + sum; }
public void onTraversalDone(Integer result) { }
}
Note that the -B flag referred to below is deprecated; these docs need to be updated
The GATK allows you to process arbitrary numbers of reference metadata (RMD) files inside of walkers (previously we called this reference ordered data, or ROD). Common RMDs are things like dbSNP, VCF call files, and refseq annotations. The only real constraints on RMD files is that:
They must contain information necessary to provide contig and position data for each element to the GATK engine so it knows with what loci to associate the RMD element.
The file must be sorted with regard to the reference fasta file so that data can be accessed sequentially by the engine.
The file must have a Tribble RMD parsing class associated with the file type so that elements in the RMD file can be parsed by the engine.
Inside of the GATK the RMD system has the concept of RMD tracks, which associate an arbitrary string name with the data in the associated RMD file. For example, the VariantEval module uses the named track eval to get calls for evaluation, and dbsnp as the track containing the database of known variants.
RMD files are extremely easy to get into the GATK using the -B syntax:
java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:variant,VCF calls.vcf
In this example, the GATK will attempt to parse the file calls.vcf using the VCF parser and bind the VCF data to the RMD track named variant.
In general, you can provide as many RMD bindings to the GATK as you like:
java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:calls1,VCF calls1.vcf -B:calls2,VCF calls2.vcf
Works just as well. Some modules may require specifically named RMD tracks -- like eval above -- and some are happy to just assess all RMD tracks of a certain class and work with those -- like VariantsToVCF.
In this snippet from SNPDensityWalker, we grab the eval track as a VariantContext object, only for the variants that are of type SNP:
public Pair<VariantContext, GenomeLoc> map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
VariantContext vc = tracker.getVariantContext(ref, "eval", EnumSet.of(VariantContext.Type.SNP), context.getLocation(), false);
}
From VariantsToVCF we call the helper function tracker.getVariantContexts to look at all of the RMDs and convert what it can to VariantContext objects.
Allele refAllele = new Allele(Character.toString(ref.getBase()), true);
Collection<VariantContext> contexts = tracker.getVariantContexts(INPUT_RMD_NAME, ALLOWED_VARIANT_CONTEXT_TYPES, context.getLocation(), refAllele, true, false);
Here's a totally general code snippet from PileupWalker.java. This code, as you can see, iterates over all of the GATKFeature objects in the reference ordered data, converting each RMD to a string and capturing these strings in a list. It finally grabs the dbSNP binding specifically for a more detailed string conversion, and then binds them all up in a single string for display along with the read pileup.
private String getReferenceOrderedData( RefMetaDataTracker tracker ) {
ArrayList
DbSNPFeature dbsnp = tracker.lookup(DbSNPHelper.STANDARD_DBSNP_TRACK_NAME, DbSNPFeature.class);
if ( dbsnp != null)
rodString += DbSNPHelper.toMediumString(dbsnp);
if ( !rodString.equals("") )
rodString = "[ROD: " + rodString + "]";
return rodString;
}
Tracks of reference metadata are loaded using the Tribble infrastructure. Tracks are loaded using the feature codec and underlying type information. See the Tribble documentation for more information.
Tribble codecs that are in the classpath are automatically found; the GATK discovers all classes that implement the FeatureCodec class. Name resolution occurs using the -B type parameter, i.e. if the user specified:
-B:calls1,VCF calls1.vcf
The GATK looks for a FeatureCodec called VCFCodec.java to decode the record type. Alternately, if the user specified:
-B:calls1,MYAwesomeFormat calls1.maft
THe GATK would look for a codec called MYAwesomeFormatCodec.java. This look-up is not case sensitive, i.e. it will resolve MyAwEsOmEfOrMaT as well, though why you would want to write something so painfully ugly to read is beyond us.
Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.
In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.
If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.
If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.
And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!
Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.
| Tool | dbSNP 129 - | - dbSNP >132 - | - Mills indels - | - 1KG indels - | - HapMap - | - Omni |
|---|---|---|---|---|---|---|
| RealignerTargetCreator | X | X | ||||
| IndelRealigner | X | X | ||||
| BaseRecalibrator | X | X | X | |||
| (UnifiedGenotyper/ HaplotypeCaller) | X | |||||
| VariantRecalibrator | X | X | X | X | ||
| VariantEval | X |
These tools require known indels passed with the -known argument to function properly. We use both the following files:
This tool requires known SNPs and indels passed with the -knownSites argument to function properly. We use all the following files:
These tools do NOT require known sites, but if SNPs are provided with the -dbsnp argument they will use them for variant annotation. We use this file:
This tool requires known SNPs and indels passed with the -resource argument to function properly. We use all the following files:
For best results, these resources should be passed with these parameters:
-resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
-resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf
This tool requires known SNPs passed with the -dbsnp argument to function properly. We use the following file:
As featured in this forum question.
Two main things account for these kinds of differences, both linked to default behaviors of the tools:
In both cases, you can end up looking at different sets or numbers of reads, which causes some of the annotation values to be different. It's usually not a cause for alarm. Remember that many of these annotations should be interpreted relatively, not absolutely.