Tagged with #intermediate
28 documentation articles | 0 events or announcements | 2 forum discussions


Detailed information about command line options for BaseRecalibrator can be found here.

Introduction

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.

This process is accomplished by analyzing the covariation among several features of a base. For example:

  • Reported quality score
  • The position within the read
  • The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

Running the tools

BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

  • read group the read belongs to
  • assigned quality score
  • machine cycle producing this base
  • current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:

 
java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -BQSR recalibration_report.grp \
   -o output.bam

After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

  • the sum of the global difference between reported quality scores and the empirical quality
  • plus the quality bin specific shift
  • plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

Miscellaneous information

  • The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
  • A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
  • Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
  • The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

Example pre and post recalibration results

  • Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
  • There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

The output of the BaseRecalibrator walker

  • A Recalibration report containing all the recalibration information for the data
  • A PDF file containing quality control plots showing the patterns of recalibration of the data
  • A temporary csv file to generate the plots (this file is automatically removed unless you provide the -k options to keep it)

The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

  • Arguments Table -- a table with all the arguments and its values
  • Quantization Table
  • ReadGroup Table
  • Quality Score Table
  • Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:

 
#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:

 
#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

 
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

  • First do an initial round of SNP calling on your original, unrecalibrated data.
  • Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
  • Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

Calling non-diploid organisms with UnifiedGenotyper

New in GATK 2.0 is the capability of UnifiedGenotyper to natively call non-diploid organisms. Three use cases are currently supported:

  1. Native variant calling in haploid or polyploid organisms.
  2. Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
  3. Pooled validation/genotyping at known sites.

In order to enable this feature, users need to set the -ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool). Note that all other UnifiedGenotyper arguments work in the same way.

A full minimal command line would look for example like

java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-I myReads.bam \
-T UnifiedGenotyper \
-ploidy 3 

The glm argument works in the same way as in the diploid case - set to [INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.

Current Limitations

Many of these limitations will be gradually removed in the following weeks as we iron out details and fix issues in the GATK 2.0 beta.

  • Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.

  • Some annotations do not work in non-diploid cases. In particular, current InbreedingCoeff is omitted. Annotations which do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF and Genotype annotations such as PL, AD, GT, etc.

  • The interaction between non-diploid calling and other experimental tools like HaplotypeCaller or ReduceReads is currently not supported.

  • Whereas it's entirely possible to use VQSR to filter non-diploid calls, we currently have no experience with this and can hence offer no support nor best practices for this.

  • Only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.

Users should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.

ReorderSam

The GATK can be particular about the ordering of a BAM file. If you find yourself in the not uncommon situation of having created or received BAM files sorted in a bad order, you can use the tool ReorderSam to generate a new BAM file where the reads have been reordered to match a well-ordered reference file.

java -jar picard/ReorderSam.jar I= lexicographc.bam O= kayrotypic.bam REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta

This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. This tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a lift over tool!

The tool, though once in the GATK, is now part of the Picard package.

The GATK discovers walker documentation by reading it out of the Javadoc, Sun's design pattern for providing documentation for packages and classes. This page will provide an extremely brief explanation of how to write Javadoc; more information on writing javadoc comments can be found in Sun's documentation.

1. Adding walker and package descriptions to the help text

The GATK's build system uses the javadoc parser to extract the javadoc for classes and packages and embed the contents of that javadoc in the help system. If you add Javadoc to your package or walker, it will automatically appear in the help. The javadoc parser will pick up on 'standard' javadoc comments, such as the following, taken from PrintReadsWalker:

/**
 * This walker prints out the input reads in SAM format.  Alternatively, the walker can write reads into a specified BAM file.
 */

You can add javadoc to your package by creating a special file, package-info.java, in the package directory. This file should consist of the javadoc for your package plus a package descriptor line. One such example follows:

/**
 * @help.display.name Miscellaneous walkers (experimental)
 */
package org.broadinstitute.sting.playground.gatk.walkers;

Additionally, the GATK provides two extra custom tags for overriding the information that ultimately makes it into the help.

  • @help.display.name Changes the name of the package as it appears in help. Note that the name of the walker cannot be changed as it is required to be passed verbatim to the -T argument.

  • @help.summary Changes the description which appears on the right-hand column of the help text. This is useful if you'd like to provide a more concise description of the walker that should appear in the help.

  • @help.description Changes the description which appears at the bottom of the help text with -T <your walker> --help is specified. This is useful if you'd like to present a more complete description of your walker.

2. Hiding experimental walkers (use sparingly, please!)

Walkers can be hidden from the documentation system by adding the @Hidden annotation to the top of each walker. @Hidden walkers can still be run from the command-line, but their documentation will not be visible to end users. Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.

3. Disabling building of help

Because the building of our help text is actually heavyweight and can dramatically increase compile time on some systems, we have a mechanism to disable help generation.

Compile with the following command:

ant -Ddisable.help=true

to disable generation of help.

1. Many of my GATK functions are setup with the same Reference, Intervals, etc. Is there a quick way to reuse these values for the different analyses in my pipeline?

Yes.

  • Create a trait that extends from CommandLineGATK.
  • In the trait, copy common values from your qscript.
  • Mix the trait into instances of your classes.

For more information, see the ExampleUnifiedGenotyper.scala or examples of using Scala's traits/mixins illustrated in the QScripts documentation.

2. How do I accept a list of arguments to my QScript?

In your QScript, define a var list and annotate it with @Argument. Initialize the value to Nil.

@Argument(doc="filter names", shortName="filter")
var filterNames: List[String] = Nil

On the command line specify the arguments by repeating the argument name.

-filter filter1 -filter filter2 -filter filter3

Then once your QScript is run, the command line arguments will be available for use in the QScript's script method.

  def script {
     var myCommand = new MyFunction
     myCommand.filters = this.filterNames
  }

For a full example of command line arguments see the QScripts documentation.

3. What is the best way to run a utility method at the right time?

Wrap the utility with an InProcessFunction. If your functionality is reusable code you should add it to Sting Utils with Unit Tests and then invoke your new function from your InProcessFunction. Computationally or memory intensive functions should NOT be implemented as InProcessFunctions, and should be wrapped in Queue CommandLineFunctions instead.

    class MySplitter extends InProcessFunction {
      @Input(doc="inputs")
      var in: File = _

      @Output(doc="outputs")
      var out: List[File] = Nil

      def run {
         StingUtilityMethod.quickSplitFile(in, out)
      }
    }

    var splitter = new MySplitter
    splitter.in = new File("input.txt")
    splitter.out = List(new File("out1.txt"), new File("out2.txt"))
    add(splitter)

See Queue CommandLineFunctions for more information on how @Input and @Output are used.

4. What is the best way to write a list of files?

Create an instance of a ListWriterFunction and add it in your script method.

import org.broadinstitute.sting.queue.function.ListWriterFunction
</pre>

<pre>
    val writeBamList = new ListWriterFunction
    writeBamList.inputFiles = bamFiles
    writeBamList.listFile = new File("myBams.list")
    add(writeBamList)

5. How do I add optional debug output to my QScript?

Queue contains a trait mixin you can use to add Log4J support to your classes.

Add the import for the trait Logging to your QScript.

import org.broadinstitute.sting.queue.util.Logging

Mixin the trait to your class.

class MyScript extends Logging {
...

Then use the mixed in logger to write debug output when the user specifies -l DEBUG.

logger.debug("This will only be displayed when debugging is enabled.")

6. I updated Queue and now I'm getting java.lang.NoClassDefFoundError / java.lang.AbstractMethodError

Try ant clean.

Queue relies on a lot of Scala traits / mixins. These dependencies are not always picked up by the scala/java compilers leading to partially implemented classes. If that doesn't work please let us know in the forum.

7. Do I need to create directories in my QScript?

No. QScript will create all parent directories for outputs.

8. How do I specify the -W 240 for the LSF hour queue at the Broad?

Queue's LSF dispatcher automatically looks up and sets the maximum runtime for whichever LSF queue is specified. If you set your -jobQueue/.jobQueue to hour then you should see something like this under bjobs -l:

RUNLIMIT
240.0 min of gsa3

9. Can I run Queue with GridEngine?

Queue GridEngine functionality is community supported. See here for full details: Queue with Grid Engine.

10. How do I pass advanced java arguments to my GATK commands, such as remote debugging?

The easiest way to do this at the moment is to mixin a trait.

First define a trait which adds your java options:

  trait RemoteDebugging extends JavaCommandLineFunction {
    override def javaOpts = super.javaOpts + " -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005"
  }

Then mix in the trait to your walker and otherwise run it as normal:

  val printReadsDebug = new PrintReads with RemoteDebugging
  printReadsDebug.reference_sequence = "my.fasta"
  // continue setting up your walker...
  add(printReadsDebug)

11. Why does Queue log "Running jobs. ... Done." but doesn't actually run anything?

If you see something like the following, it means that Queue believes that it previously successfully generated all of the outputs.

INFO 16:25:55,049 QCommandLine - Scripting ExampleUnifiedGenotyper 
INFO 16:25:55,140 QCommandLine - Added 4 functions 
INFO 16:25:55,140 QGraph - Generating graph. 
INFO 16:25:55,164 QGraph - Generating scatter gather jobs. 
INFO 16:25:55,714 QGraph - Removing original jobs. 
INFO 16:25:55,716 QGraph - Adding scatter gather jobs. 
INFO 16:25:55,779 QGraph - Regenerating graph. 
INFO 16:25:55,790 QGraph - Running jobs. 
INFO 16:25:55,853 QGraph - 0 Pend, 0 Run, 0 Fail, 10 Done 
INFO 16:25:55,902 QCommandLine - Done 

Queue will not re-run the job if a .done file is found for the all the outputs, e.g.: /path/to/.output.file.done. You can either remove the specific .done files yourself, or use the -startFromScratch command line option.

The GATK is an open source project that has greatly benefited from the contributions of outside users. The GATK team welcomes contributions from anyone who produces useful functionality in line with the goals of the toolkit. You are welcome to branch the GATK main repository and develop your own tools. Sometimes these tools may be useful to the GATK user community and you may want to make it part of the main GATK distribution. If so we ask you to follow our guidelines for submission of patches.

1. Good practices

There are a few good GIT practices that you should follow to simplify the ultimate goal, which is, adding your changes to the main GATK repository.

  • Use branches.
    Every time you start new work that you are going to submit to the GATK team later, do it in a new branch. Make it a habit as this will simplify many of the following procedures and allow your master branch to always be a fresh (up to date) copy of the GATK main repository. Take a look on [[#How to create a new submission| how to create a new branch for submission]].
  • Never merge.
    Merging creates a branched history with multiple parent nodes that make history hard to understand, impossible to modify and patches near-impossible to create. Merges are very useful when you need to combine multiple repositories and it should ''only'' be used when it makes sense. This means '''never merge''' and '''never pull''' (if it's not a fast-forward, or you will create a merge).
  • Commit as often as possible.
    Every change, should be committed to make sure you can go back in time effectively in your own tree. The commit messages don't matter to us as long as they're meaningful to you in this stage. You can essentially do whatever you want in your local tree with your commits, as long as you don't merge.
  • Rebase constantly
    Your branch is diverging from the master by the minute, so if you keep rebasing as often as you can, you will avoid major conflicts when it's time to send the patches. Take a look at our guide on [[#How to rebase | how to rebase]].
  • Tell a meaningful story
    When it's time to submit your patches to us, reorder your commits and write meaningful commit messages. Each commit must be (as much as possible) self contained. These commits must tell a meaningful story to us so we can understand what it is you're adding to the codebase. Take a look at an [[#How to make your commits | example commit scenario]].
  • Generate patches and email them to the group
    This part is super easy, provided you've followed the good practices. You just have to [[#How to generate the patches | generate the patches]] and e-mail them to gsa-patches@broadinstitute.org.

2. How to create a new submission

You should always start your code by creating a new branch from the most recent version of the main repository with :

git checkout master                       (make sure you are in the master branch)
git fetch && git rebase origin/master     (you can substitute this line for "git pull" if you have no changes in the master branch) 
git checkout -b newtool                   (create a new branch for your new tool)

Note: If you have submitted a patch to the group, do not continue development on the same branch as we cannot guarantee that your changes will make it to the main repository unchanged.

3. How to rebase

Every time before you rebase, you have to update your copy of the main repository. To do this use:

git fetch

If you are just trying to keep up with the changes in the main repository after a fetch, you can rebase your branch at anytime using (and this should be all you need to do):

git rebase origin/master

In case there are conflicts, resolve them as you would and do:

git rebase --continue

If you don't know how to resolve the conflicts, you can always safely abort the whole process and go back to your branch before you started rebasing:

git rebase --abort

If you are done and want to generate your patches conforming to the latest repository changes, to edit, squash and reorder your commits use :

git rebase -i origin/master

At the prompt, you can follow the instructions to squash, edit and reorder accordingly. You can also do this step from IntelliJ with a visual editor that allows you to select what to edit/squash/reorder. You can also take a look at this nice tutorial on how to use interactive rebase.

4. How to make your commits

It is okay to have a list of commits (numbered) somewhat like this in your local tree:

  • added function X
  • fixed a b and c on X
  • b was actually d
  • started creating feature Y but had to go to the bathroom
  • added Y
  • found bug in X, fixed with e
  • added Z
  • fixed bug in Z with f

Before you can send your tools to us, you have to organize these commits so they tell a meaningful history and are self contained. To achieve this you will need to rebase so you can squash, edit and reorder your commits. This tree makes a lot of sense for your development process, but it makes no sense in the main repository history as it becomes hard to pick/revert commits and understand the history at a glance. After rebasing, you should edit your commits to look like this:

  • added X (including commits 2, 3 and 6)
  • added Y (including commits 4 and 5)
  • added Z (including commits 7 and 8)

Use your commit messages wisely to help quick processing of your patches. Make sure the first line of your commit messages have less than 50 characters (title). Add a blank line and write a paragraph or more explaining what this commit represents (now that it is a package of multiple commits. It is important to have the 50 char title because this is all we see when we look at an extended history to find bugs and it is also our quick access to remember what the commit does to the repository.

A patch should be self contained. Meaning if we decide to adopt feature X and Z but not Y, we should be able to do so by only applying patches 1 and 2. If your patches are co-dependent, you should say so in the commits and justify why you didn't squash the commits together into one tool.

5. How to generate the patches

To generate patches, use :

git format-patch since  

The since parameter is the last commit you want to generate patches from, for example: HEAD^3 will generate patches for HEAD^2, HEAD^1 and HEAD. You can also specify the commit by its id or by using the head of a branch. This is where using branches will make your life easier. If master is always up to date with the main repo with no changes, you can do:

git format-patch master   (provided your master is up to date) 

This will generate a patch for each commit you've created and you can simply e-mail them as an attachment to us.

Introduction

Three-stage procedure:

  • Create a master set of sites from your N batch VCFs that you want to genotype in all samples. At this stage you need to determine how you want to resolve disagreements among the VCFs. This is your master sites VCF.

  • Take the master sites VCF and genotype each sample BAM file at these sites

  • (Optionally) Merge the single sample VCFs into a master VCF file

Creating the master set of sites: SNPs and Indels

The first step of batch merging is to create a master set of sites that you want to genotype in all samples. To make this problem concrete, suppose I have two VCF files:

Batch 1:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12891 
20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30
20      10000000        .       T       G       .       PASS    .       GT:GQ   0/1:30
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000211        .       C       T       .       PASS    .       GT:GQ   0/1:30
20      10001436        .       A       AGG     .       PASS    .       GT:GQ   1/1:30

Batch 2:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000211        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000598        .       T       A       .       PASS    .       GT:GQ   1/1:30
20      10001436        .       A       AGGCT   .       PASS    .       GT:GQ   1/1:30

In order to merge these batches, I need to make a variety of bookkeeping and filtering decisions, as outlined in the merged VCF below:

Master VCF:

20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30  [pass in both]
20      10000000        .       T       G       .       PASS    .       GT:GQ   0/1:30  [only in batch 1]
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30  [fail in both]
20      10000211        .       C       T       .       FAIL    .       GT:GQ   0/1:30  [pass in 1, fail in 2, choice in unclear]
20      10000598        .       T       A       .       PASS    .       GT:GQ   1/1:30  [only in batch 2]
20      10001436        .       A       AGGCT   .       PASS    .       GT:GQ   1/1:30  [A/AGG in batch 1, A/AGGCT in batch 2, including this site may be problematic]

These issues fall into the following categories:

  • For sites present in all VCFs (20:9999996 above), the alleles agree, and each site PASS is pass, this site can obviously be considered "PASS" in the master VCF
  • Some sites may be PASS in one batch, but absent in others (20:10000000 and 20:10000598), which occurs when the site is polymorphic in one batch but all samples are reference or no-called in the other batch
  • Similarly, sites that are fail in all batches in which they occur can be safely filtered out, or included as failing filters in the master VCF (20:10000117)

There are two difficult situations that must be addressed by the needs of the project merging batches:

  • Some sites may be PASS in some batches but FAIL in others. This might indicate that either:
  • The site is actually truly polymorphic, but due to limited coverage, poor sequencing, or other issues it is flag as unreliable in some batches. In these cases, it makes sense to include the site
  • The site is actually a common machine artifact, but just happened to escape standard filtering in a few batches. In these cases, you would obviously like to filter out the site
  • Even more complicated, it is possible that in the PASS batches you have found a reliable allele (C/T, for example) while in others there's no alt allele but actually a low-frequency error, which is flagged as failing. Ideally, here you could filter out the failing allele from the FAIL batches, and keep the pass ones
  • Some sites may have multiple segregating alleles in each batch. Such sites are often errors, but in some cases may be actual multi-allelic sites, in particular for indels.

Unfortunately, we cannot determine which is actually the correct choice, especially given the goals of the project. We leave it up the project bioinformatician to handle these cases when creating the master VCF. We are hopeful that at some point in the future we'll have a consensus approach to handle such merging, but until then this will be a manual process.

The GATK tool CombineVariants can be used to merge multiple VCF files, and parameter choices will allow you to handle some of the above issues. With tools like SelectVariants one can slice-and-dice the merged VCFs to handle these complexities as appropriate for your project's needs. For example, the above master merge can be produced with the following CombineVariants:

java -jar dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R human_g1k_v37.fasta \
-V:one,VCF combine.1.vcf -V:two,VCF combine.2.vcf \
--sites_only \
-minimalVCF \
-o master.vcf

producing the following VCF:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
20      9999996     .       A       ACT         .       PASS    set=Intersection
20      10000000        .       T       G           .   PASS    set=one
20      10000117        .       C       T           .       FAIL    set=FilteredInAll
20      10000211        .       C       T           .       PASS    set=filterIntwo-one
20      10000598        .       T       A           .       PASS    set=two
20      10001436        .       A       AGG,AGGCT       .       PASS    set=Intersection

Genotyping your samples at these sites

Having created the master set of sites to genotype, along with their alleles, as in the previous section, you now use the UnifiedGenotyper to genotype each sample independently at the master set of sites. This GENOTYPE_GIVEN_ALLELES mode of the UnifiedGenotyper will jump into the sample BAM file, and calculate the genotype and genotype likelihoods of the sample at the site for each of the genotypes available for the REF and ALT alleles. For example, for site 10000211, the UnifiedGenotyper would evaluate the likelihoods of the CC, CT, and TT genotypes for the sample at this site, choose the most likely configuration, and generate a VCF record containing the genotype call and the likelihoods for the three genotype configurations.

As a concrete example command line, you can genotype the master.vcf file using in the bundle sample NA12878 with the following command:

java -Xmx2g -jar dist/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R bundle/b37/human_g1k_v37.fasta \
-I bundle/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam \
-alleles master.vcf \
-L master.vcf \
-gt_mode GENOTYPE_GIVEN_ALLELES \
-out_mode EMIT_ALL_SITES \
-stand_call_conf 0.0 \
-glm BOTH \
-G none \

The -L master.vcf argument tells the UG to only genotype the sites in the master file. If you don't specify this, the UG will genotype the master sites in GGA mode, but it will also genotype all other sites in the genome in regular mode.

The last item,-G ` prevents the UG from computing annotations you don't need. This command produces something like the following output:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
20      9999996     .       A       ACT         4576.19 .       .   GT:DP:GQ:PL     1/1:76:99:4576,229,0
20      10000000        .       T       G           0       .       .       GT:DP:GQ:PL     0/0:79:99:0,238,3093
20      10000211        .       C       T       857.79  .       .   GT:AD:DP:GQ:PL  0/1:28,27:55:99:888,0,870
20      10000598        .       T       A           1800.57 .       .   GT:AD:DP:GQ:PL  1/1:0,48:48:99:1834,144,0
20      10001436        .       A       AGG,AGGCT       1921.12 .       .   GT:DP:GQ:PL     0/2:49:84.06:1960,2065,0,2695,222,84

Several things should be noted here:

  • The genotype likelihoods calculation evolves, especially for indels, the exact results of this command will change.
  • The command will emit sites that are hom-ref in the sample at the site, but the -stand_call_conf 0.0 argument should be provided so that they aren't tagged as "LowQual" by the UnifiedGenotyper.
  • The filtered site 10000117 in the master.vcf is not genotyped by the UG, as it doesn't pass filters and so is considered bad by the GATK UG. If you want to determine the genotypes for all sites, independent on filtering, you must unfilter all of your records in master.vcf, and if desired, restore the filter string for these records later.

This genotyping command can be performed independently per sample, and so can be parallelized easily on a farm with one job per sample, as in the following:

foreach sample in samples:
  run UnifiedGenotyper command above with -I $sample.bam -o $sample.vcf
end

(Optional) Merging the sample VCFs together

You can use a similar command for CombineVariants above to merge back together all of your single sample genotyping runs. Suppose all of my UnifiedGenotyper jobs have completed, and I have VCF files named sample1.vcf, sample2.vcf, to sampleN.vcf. The single command:

java -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V:sample1 sample1.vcf -V:sample2 sample2.vcf [repeat until] -V:sampleN sampleN.vcf -o combined.vcf

General notes

  • Because the GATK uses dynamic downsampling of reads, it is possible for truly marginal calls to change likelihoods from discovery (processing the BAM incrementally) vs. genotyping (jumping into the BAM). Consequently, do not be surprised to see minor differences in the genotypes for samples from discovery and genotyping.
  • More advanced users may want to consider group several samples together for genotyping. For example, 100 samples could be genotyped in 10 groups of 10 samples, resulting in only 10 VCF files. Merging the 10 VCF files may be faster (or just easier to manage) than 1000 individual VCFs.
  • Sometimes, using this method, a monomorphic site within a batch will be identified as polymorphic in one or more samples within that same batch. This is because the UnifiedGenotyper applies a frequency prior to determine whether a site is likely to be monomorphic. If the site is monomorphic, it is either not output, or if EMIT_ALL_SITES is thrown, reference genotypes are output. If the site is determined to be polymorphic, genotypes are assigned greedily (as of GATK-v1.4). Calling single-sample reduces the effect of the prior, so sites which were considered monomorphic within a batch could be considered polymorphic within a sub-batch.

Introduction

Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.

The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

BWA alignment

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)

To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.

Best Practices for Variant Calling

Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.

Problems with Variant Calling with Pacific Biosciences

  • Calling must be more permissive of indels in the data.

You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.

  • Base quality thresholds should be adjusted to the specifics of your data.

Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.

  • Reference bias

To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.

1. Redistributing the GATK-Lite or distributing walkers

The GATK team would love to hear about any applications within which of the GATK-Lite codebase is embedded or walkers which you have chosen to distribute. Please send an email to gsahelp to let us know!

When redistributing the GATK-Lite codebase, please abide by the terms of our copyright:

/*
 * Copyright (c) 2009 The Broad Institute
 *
 * Permission is hereby granted, free of charge, to any person
 * obtaining a copy of this software and associated documentation
 * files (the "Software"), to deal in the Software without
 * restriction, including without limitation the rights to use,
 * copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following
 * conditions:
 *
 * The above copyright notice and this permission notice shall be
 * included in all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
 * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
 * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 * OTHER DEALINGS IN THE SOFTWARE.
 */

2. Packaging walkers

The packaging tool in the Sting repository can layout packages for redistribution. Currently, only walkers checked into the GATK's git repository are well supported by the packaging system. Example packaging files can be found in $STING_HOME/packages.

3. Defining a package

Create a package xml for your project inside $STING_HOME/packages.

Key elements within the package xml include:

  • executable

    Each occurrence of this tag will create an executable jar of the given name tag, using the main method from the given main-class tag.

  • main-class

    This is the main class for the package. When running with java -jar YOUR_JAR.jar, main-class is the class that will be executed.

  • dependencies

    Other dependencies can be of type class or file. If of type class, a dependency analyzer will look for all dependencies of your classes and include those files as well. File dependencies will end up in the root of your package.

  • resources

    Supplemental files can be added to the resources section. Resource files will be copied to the resources directory within the package.

3. Creating a package

To create a package, execute the following command:

cd $STING_HOME
ant package -Dexecutable=<your executable name>

The packaging system will create a layout directory in dist/packages/<your executable>. Examine the contents of this directory. When you are happy with the results, finalize the package by running the following:

tar cvhjf <your executable>.tar.bz2 <your executable>

Workflow

To call variants with the GATK using pedigree information, you should base your workflow on the Best Practices recommendations -- the principles detailed there all apply to pedigree analysis.

But there is one crucial addition: you should make sure to pass a pedigree file (PED file) to all GATK walkers that you use in your workflow. Some will deliver better results if they see the pedigree data.

At the moment there are two of the standard annotations affected by pedigree:

  • Allele Frequency (computed on founders only)
  • Inbreeding coefficient (computed on founders only)

Trio Analysis

In the specific case of trios, an additional GATK walker, PhaseByTransmission, should be used to obtain trio-aware genotypes as well as phase by descent.

Important note

The annotations mentioned above have been adapted for PED files starting with GATK v.1.6. If you already have VCF files generated by an older version of the GATK or have not passed a PED file while running the UnifiedGenotyper or VariantAnnotator, you should do the following:

  • Run the latest version of the VariantAnnotator to re-annotate your variants.
  • Re-annotate all the standard annotations by passing the argument -G StandardAnnotation to VariantAnnotator. Make sure you pass your PED file to the VariantAnnotator as well!
  • If you are using Variant Quality Score Recalibration (VQSR) with the InbreedingCoefficient as an annotation in your model, you should re-run VQSR once the InbreedingCoefficient is updated.

PED files

The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.

For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.

1. Introduction

The GATK provides an implementation of the Per-Base Alignment Qualities (BAQ) developed by Heng Li in late 2010. See this SamTools page for more details.

2. Using BAQ

The BAQ algorithm is applied by the GATK engine itself, which means that all GATK walkers can potentially benefit from it. By default, BAQ is OFF, meaning that the engine will not use BAQ quality scores at all.

The GATK engine accepts the argument -baq with the following enum values:

public enum CalculationMode {
    OFF,                        // don't apply a BAQ at all, the default
    CALCULATE_AS_NECESSARY,     // do HMM BAQ calculation on the fly, as necessary, if there's no tag
    RECALCULATE                 // do HMM BAQ calculation on the fly, regardless of whether there's a tag present
}

If you want to enable BAQ, the usual thing to do is CALCULATE_AS_NECESSARY, which will calculate BAQ values if they are not in the BQ read tag. If your reads are already tagged with BQ values, then the GATK will use those. RECALCULATE will always recalculate the BAQ, regardless of the tag, which is useful if you are experimenting with the gap open penalty (see below).

If you are really an expert, the GATK allows you to specify the BAQ gap open penalty (-baqGOP) to use in the HMM. This value should be 40 by default, a good value for whole genomes and exomes for highly sensitive calls. However, if you are analyzing exome data only, you may want to use 30, which seems to result in more specific call set. We continue to play with these values some. Some walkers, where BAQ would corrupt their analyses, forbid the use of BAQ and will throw an exception if -baq is provided.

3. Some example uses of the BAQ in the GATK

  • For UnifiedGenotyper to get more specific SNP calls.

  • For PrintReads to write out a BAM file with BAQ tagged reads

  • For TableRecalibrator or IndelRealigner to write out a BAM file with BAQ tagged reads. Make sure you use -baq RECALCULATE so the engine knows to recalculate the BAQ after these tools have updated the base quality scores or the read alignments. Note that both of these tools will not use the BAQ values on input, but will write out the tags for analysis tools that will use them.

Note that some tools should not have BAQ applied to them.

This last option will be a particularly useful for people who are already doing base quality score recalibration. Suppose I have a pipeline that does:

RealignerTargetCreator
IndelRealigner

CountCovariates
TableRecalibrate

UnifiedGenotyper

A highly efficient BAQ extended pipeline would look like

RealignerTargetCreator
IndelRealigner // don't bother with BAQ here, since we will calculate it in table recalibrator

CountCovariates
TableRecalibrate -baq RECALCULATE // now the reads will have a BAQ tag added.  Slows the tool down some

UnifiedGenotyper -baq CALCULATE_AS_NECESSARY // UG will use the tags from TableRecalibrate, keeping UG fast

4. BAQ and walker control

Walkers can control via the @BAQMode annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (ON_INPUT, the default), to output reads (ON_OUTPUT), or HANDLED_BY_WALKER, which means that calling into the BAQ system is the responsibility of the individual walker.

These are the most popular Queue command line options. For a complete and up to date list run with -help. QScripts may also add additional command line options.

1. Queue Command Line Options

Command Line Argument Description Default
-run If passed the scripts are run. If not passed a dry run is executed. dry run
-jobRunner <jobrunner> The job runner to dispatch jobs. Setting to Lsf706, GridEngine, or Drmaa will dispatch jobs to LSF or Grid Engine using the job settings (see below). Defaults to Shell which runs jobs on a local shell one at a time. Shell
-bsub Alias for -jobRunner Lsf706 not set
-qsub Alias for -jobRunner GridEngine not set
-status Prints out a summary progress. If a QScript is currently running via -run, you can run the same command line with -status instead to print a summary of progress. not set
-retry <count> Retries a QFunction that returns a non-zero exit code up to count times. The QFunction must not have set jobRestartable to false. 0 = no retries
-startFromScratch Restarts the graph from the beginning. If not specified for each output file specified on a QFunction, ex: /path/to/output.file, Queue will not re-run the job if a .done file is found for the all the outputs, ex: /path/to/.output.file.done. use .done files to determine if jobs are complete
-keepIntermediates By default Queue deletes the output files of QFunctions that set .isIntermediate to true. delete intermediate files
-statusTo <email> Email address to send status to whenever a) A job fails, or b) Queue has run all the functions it can run and is exiting. not set
-statusFrom <email> Email address to send status emails from. user@local.domain
-dot <file> If set renders the job graph to a dot file. not rendered
-l <logging_level> The minimum level of logging, DEBUG, INFO, WARN, or FATAL. INFO
-log <file> Sets the location to save log output in addition to standard out. not set
-debug Set the logging to include a lot of debugging information (SLOW!) not set
-jobReport Path to write the job report text file. If R is installed and available on the $PATH then a pdf will be generated visualizing the job report. jobPrefix.jobreport.txt
-disableJobReport Disables writing the job report. not set
-help Lists all of the command line arguments with their descriptions. not set

2. QFunction Options

The following options can be specified on the command line over overridden per QFunction.

Command Line Argument QFunction Property Description Default
-jobPrefix .jobName The unique name of the job. Used to prefix directories and log files. Use -jobNamePrefix on the Queue command line to replace the default prefix Q-<processid>@<host>. <jobNamePrefix>-<jobNumber>
N/A .jobOutputFile Captures stdout and if jobErrorFile is null it captures stderr as well. <jobName>.out
N/A .jobErrorFile If not null captures stderr. null
N/A .commandDirectory The directory to execute the command line from. current directory
-jobProject .jobProject The project name for the job. default job runner project
-jobQueue .jobQueue The queue to dispatch the job. default job runner queue
-jobPriority .jobPriority The dispatch priority for the job. Lowest priority = 0. Highest priority = 100. default job runner priority
-jobNative .jobNativeArgs Native args to pass to the job runner. Currently only supported in GridEngine and Drmaa. The string is concatenated to the native arguments passed over DRMAA. Example: -w n. none
-jobResReq .jobResourceRequests Resource requests to pass to the job runner. On GridEngine this is multiple -l <req>. On LSF a single -R <req> is generated. memory reservations and limits on LSF and GridEngine
-jobEnv .jobEnvironmentNames Predefined environment names to pass to the job runner. On GridEngine this is -pe <env>. On LSF this is -a <env>. none
-memLimit .memoryLimit The memory limit for the job in gigabytes. Used to populate the variables residentLimit and residentRequest which can also be set separately. default job runner memory limit
-resMemLimit .residentLimit Limit for the resident memory in gigabytes. On GridEngine this is -l mem_free=<mem>. On LSF this is -R rusage[mem=<mem>]. memoryLimit * 1.2
-resMemReq .residentRequest Requested amount of resident memory in gigabytes. On GridEngine this is -l h_rss=<mem>. On LSF this is -R rusage[select=<mem>]. memoryLimit

3. Email Status Options

Command Line Argument Description Default
-emailHost <hostname> SMTP host name localhost
-emailPort <port> SMTP port 25
-emailTLS If set uses TLS. not set
-emailSSL If set uses SSL. not set
-emailUser <username> If set along with emailPass or emailPassFile authenticates the email with this username. not set
-emailPassFile <file> If emailUser is also set authenticates the email with contents of the file. not set
-emailPass <password> If emailUser is also set authenticates the email with this password. NOT SECURE: Use emailPassFile instead! not set

1. Background

Thanks to contributions from the community, Queue contains a job runner compatible with Grid Engine 6.2u5.

As of July 2011 this is the currently known list of forked distributions of Sun's Grid Engine 6.2u5. As long as they are JDRMAA 1.0 source compatible with Grid Engine 6.2u5, the compiled Queue code should run against each of these distributions. However we have yet to receive confirmation that Queue works on any of these setups.

Our internal QScript integration tests run the same tests on both LSF 7.0.6 and a Grid Engine 6.2u5 cluster setup on older software released by Sun.

If you run into trouble, please let us know. If you would like to contribute additions or bug fixes please create a fork in our github repo where we can review and pull in the patch.

2. Running Queue with GridEngine

Try out the Hello World example with -jobRunner GridEngine.

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run

If all goes well Queue should dispatch the job to Grid Engine and wait until the status returns RunningStatus.DONE and "hello world should be echoed into the output file, possibly with other grid engine log messages.

See QFunction and Command Line Options for more info on Queue options.

3. Debugging issues with Queue and GridEngine

If you run into an error with Queue submitting jobs to GridEngine, first try submitting the HelloWorld example with -memLimit 2:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run -memLimit 2

Then try the following GridEngine qsub commands. They are based on what Queue submits via the API when running the HelloWorld.scala example with and without memory reservations and limits:

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=2048M -l h_rss=2458M echo hello world

One other thing to check is if there is a memory limit on your cluster. For example try submitting jobs with up to 16G.

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=4096M -l h_rss=4915M echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=8192M -l h_rss=9830M echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=16384M -l h_rss=19960M echo hello world

If the above tests pass and GridEngine will still not dispatch jobs submitted by Queue please report the issue to our support forum.

We have found it that Queue works best with IntelliJ IDEA Community Edition (free) or Ultimate Edition installed with the Scala Plugin enabled. Once you have downloaded IntelliJ IDEA, follow the instructions below to setup a Sting project with Queue and the Scala Plugin.

[[File:sting_project_libraries.png|300px|thumb|right|Project Libraries]] [[File:sting_module_sources.png|300px|thumb|right|Module Sources]] [[File:sting_module_dependencies.png|300px|thumb|right|Module Dependencies]] [[File:sting_module_scala_facet.png|300px|thumb|right|Scala Facet]]

1. Build Queue on the Command Line

Build Queue from source from the command line with ant queue, so that: - The lib folder is initialized including the scala jars - The queue-extensions for the GATK are generated to the build folder

2. Add the scala plugin

  • In IntelliJ, open the menu File > Settings
  • Under the IDE Settings in the left navigation list select Plugins
  • Click on the Available tab under plugins
  • Scroll down in the list of available plugins and install the scala plugin
  • If asked to retrieve dependencies, click No. The correct scala libraries and compiler are already available in the lib folder from when you built Queue from the command line
  • Restart IntelliJ to load the scala plugin

3. Creating a new Sting Project including Queue

  • Select the menu File... > New Project...

  • On the first page of "New Project" select Create project from scratch Click Next >

  • On the second page of "New Project" select Set the project Name: to Sting Set the Project files location: to the directory where you checked out the Sting git repository, for example /Users/jamie/src/Sting Uncheck Create Module Click Finish

  • The "Project Structure" window should open. If not open it via the menu File > Project Structure

  • Under the Project Settings in the left panel of "Project Structure" select Project Make sure that Project SDK is set to a build of 1.6 If the Project SDK only lists <No SDK> add a New > JSDK pointing to /System/Library/Frameworks/JavaVM.framework/Versions/1.6

  • Under the Project Settings in the left panel of "Project Structure" select Libraries Click the plus (+) to create a new Project Library Set the Name: to Sting/lib Select Attach Jar Directories Select the path to lib folder under your SVN checkout

  • Under the Project Settings in the left panel of "Project Structure" select Modules

  • Click on the + box to add a new module

  • On the first page of "Add Module" select Create module from scratch Click Next \>

  • On the second page of "Add Module" select Set the module Name: to Sting Change the Content root to: <directory where you checked out the Sting SVN repository> Click Next \>

  • On the third page Uncheck all of the other source directories only leaving the java/src directory checked Click Next \>

  • On fourth page click Finish

  • Back in the Project Structure window, under the Module 'Sting', on the Sources tab make sure the following folders are selected

    • Source Folders (in blue): public/java/src public/scala/src private/java/src (Broad only) private/scala/src (Broad only) build/queue-extensions/src
    • Test Source Folders (in green): public/java/test public/scala/test private/java/test (Broad only) private/scala/test (Broad only)
  • In the Project Structure window, under the Module 'Sting', on the Module Dependencies tab select Click on the button Add... Select the popup menu Library... Select the Sting/lib library Click Add selected

  • Refresh the Project Structure window so that it becomes aware of the Scala library in Sting/lib Click the OK button Reopen Project Structure via the menu File > Project Structure

  • In the second panel, click on the Sting module Click on the plus (+) button above the second panel module In the popup menu under Facet select Scala On the right under Facet 'Scala' set the Compiler library: to Sting/lib Click OK

4. Enable annotation processing

  • Open the menu File > Settings
  • Under Project Settings [Sting] in the left navigation list select Compiler then Annotation Processors
  • Click to enable the checkbox Enable annotation processing
  • Leave the radio button obtain processors from the classpath selected
  • Click OK

5. Debugging Queue

Adding a Remote Configuration

[[File:queue_debug.png|300px|thumb|right|Queue Remote Debug]]

  • In IntelliJ 10 open the menu Run > Edit Configurations.

  • Click the gold [+] button at the upper left to open the Add New Configuration popup menu.

  • Select Remote from the popup menu.

  • With the new configuration selected on the left, change the configuration name from 'Unnamed' to something like 'Queue Remote Debug'.

  • Set the Host to the hostname of your server, and the Port to an unused port. You can try the default port of 5005.

  • From the Use the following command line arguments for running remote JVM, copy the argument string.

  • On the server, paste / modify your command line to run with the previously copied text, for example java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005 Queue.jar -S myscript.scala ....

  • If you would like the program to wait for you to attach the debugger before running, change suspend=n to suspend=y.

  • Back in IntelliJ, click OK to save your changes.

Running with the Remote Configuration

  • Ensure Queue Remote Debug is selected via the configuration drop down or Run > Edit Configurations.
  • Set your breakpoints as you normally would in IntelliJ.
  • Start your program by running the full java path (with the above -Xdebug -Xrunjdwp ...) on the server.
  • In IntelliJ go to the Run > Debug.

6. Binding javadocs and source

From Stack overflow:

Add javadocs:

Point IntelliJ to http://download.oracle.com/javase/6/docs/api/.
Go to File -> Project Structure -> SDKs -> Apple 1.x -> DocumentationPaths, and the click specify URL.

Add sources:

In IntelliJ, open File -> Project Structure. Click on "SDKs" under "Platform Settings". Add the following path under the Sourcepath tab: /Library/Java/JavaVirtualMachines/1.6.0_29-b11-402.jdk/Contents/Home/src.jar!/src

1. Introduction

Reads can be filtered out of traversals by either pileup size through one of our downsampling methods or by read property through our read filtering mechanism. Both techniques and described below.

2. Downsampling

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Defaults

The GATK's default downsampler exhibits the following properties:

  • The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.

  • The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.

  • The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.

Customizing

From the command line:

  • To disable the downsampler, specify -dt NONE.

  • To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

  • Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:

For each sample:

  • Select reads with the next alignment start.

  • While the number of existing reads + the number of incoming reads is greater than the target sample size:

    Walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.

  • If we have n slots avaiable where n is >= 1, randomly select n of the incoming reads and add them to the pileup.

  • Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

3. Read filtering

To selectively filter out reads before they reach your walker, implement one or multiple net.sf.picard.filter.SamRecordFilter, and attach it to your walker as follows:

@ReadFilters({Platform454Filter.class, ZeroMappingQualityReadFilter.class})

4. Command-line arguments for read filters

You can add command-line arguments for filters with the @Argument tag, just as with walkers. Here's an example of our new max read length filter:

public class MaxReadLengthFilter implements SamRecordFilter {
    @Argument(fullName = "maxReadLength", shortName = "maxRead", doc="Discard reads with length greater than the specified value", required=false)
    private int maxReadLength;

    public boolean filterOut(SAMRecord read) { return read.getReadLength() > maxReadLength; }
}

Adding this filter to the top of your walker using the @ReadFilters attribute will add a new required command-line argument, maxReadLength, which will filter reads > maxReadLength before your walker is called.

Note that when you specify a read filter, you need to strip the Filter part of its name off! E.g. in the example above, if you want to use MaxReadLengthFilter, you need to call it like this:

--read_filter MaxReadLength

5. Adding filters dynamically using command-line arguments

The --read-filter argument will allow you to apply whatever read filters you'd like to your dataset, before the reads reach your walker. To add the MaxReadLength filter above to PrintReads, you'd add the command line parameters:

--read_filter MaxReadLength --maxReadLength 76

You can add as many filters as you like by using multiple copies of the --read_filter parameter:

--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead

This script can be used for sorting an input file based on a reference.

#!/usr/bin/perl -w

use strict;
use Getopt::Long;

sub usage {

    print "\nUsage:\n";
    print "sortByRef.pl [--k POS] INPUT REF_DICT\n\n";

    print " Sorts lines of the input file INFILE according\n";
    print " to the reference contig order specified by the\n";
    print " reference dictionary REF_DICT (.fai file).\n";
    print " The sort is stable. If -k option is not specified,\n";
    print " it is assumed that the contig name is the first\n";
    print " field in each line.\n\n";
    print "  INPUT      input file to sort. If '-' is specified, \n";
    print "             then reads from STDIN.\n";
    print "  REF_DICT   .fai file, or ANY file that has contigs, in the\n";
    print "             desired soting order, as its first column.\n";
    print "  --k POS :  contig name is in the field POS (1-based)\n";
    print "             of input lines.\n\n";

    exit(1);
}

my $pos = 1;
GetOptions( "k:i" => \$pos );

$pos--;

usage() if ( scalar(@ARGV) == 0 );

if ( scalar(@ARGV) != 2 ) {
    print "Wrong number of arguments\n";
    usage();
}

my $input_file = $ARGV[0];
my $dict_file = $ARGV[1];


open(DICT, "< $dict_file") or die("Can not open $dict_file: $!");

my %ref_order;

my $n = 0;
while ( <DICT> ) {
    chomp;
    my ($contig, $rest) = split "\t";
    die("Dictionary file is probably corrupt: multiple instances of contig $contig") if ( defined $ref_order{$contig} );

    $ref_order{$contig} = $n;
    $n++;
}

close DICT;
#we have loaded contig ordering now

my $INPUT;
if ($input_file eq "-" ) {
    $INPUT = "STDIN";
} else {
    open($INPUT, "< $input_file") or die("Can not open $input_file: $!");
}

my %temp_outputs;

while ( <$INPUT> ) {

    my @fields = split '\s';
    die("Specified field position exceeds the number of fields:\n$_") 
        if ( $pos >= scalar(@fields) );

    my $contig = $fields[$pos];
    if ( $contig =~ m/:/ ) {
        my @loc = split(/:/, $contig);
        # print $contig . " " . $loc[0] . "\n";
        $contig = $loc[0]
    }
    chomp $contig if ( $pos == scalar(@fields) - 1 ); # if last field in line

    my $order;
    if ( defined $ref_order{$contig} ) { $order = $ref_order{$contig}; }
    else {
        $order = $n; # input line has contig that was not in the dict; 
        $n++; # this contig will go at the end of the output, 
              # after all known contigs
    }

    my $fhandle;
    if ( defined $temp_outputs{$order} ) { $fhandle = $temp_outputs{$order} }
    else {
        #print "opening $order $$ $_\n";
        open( $fhandle, " > /tmp/sortByRef.$$.$order.tmp" ) or
            die ( "Can not open temporary file $order: $!");
        $temp_outputs{$order} = $fhandle;
    }

    # we got the handle to the temp file that keeps all 
    # lines with contig $contig

    print $fhandle $_; # send current line to its corresponding temp file
}

close $INPUT;

foreach my $f ( values %temp_outputs ) { close $f; }

# now collect back into single output stream:

for ( my $i = 0 ; $i < $n ; $i++ ) {
    # if we did not have any lines on contig $i, then there's 
    # no temp file and nothing to do
    next if ( ! defined $temp_outputs{$i} ) ; 

    my $f; 
    open ( $f, "< /tmp/sortByRef.$$.$i.tmp" );
    while ( <$f> ) { print ; }
    close $f;

    unlink "/tmp/sortByRef.$$.$i.tmp";


}

1. Overview

The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library.

2. Architecture Overview

Tribble provides a lightweight interface and API for querying features and creating indexes from feature files, while allowing iteration over know feature files that we're unable to create indexes for. The main entry point for external users is the BasicFeatureReader class. It takes in a codec, an index file, and a file containing the features to be processed. With an instance of a BasicFeatureReader, you can query for features that span a specific location, or get an iterator over all the records in the file.

3. Developer Overview

For developers, there are two important classes to implement: the FeatureCodec, which decodes lines of text and produces features, and the feature class, which is your underlying record type.

For developers there are two classes that are important:

  • Feature

    This is the genomicly oriented feature that represents the underlying data in the input file. For instance in the VCF format, this is the variant call including quality information, the reference base, and the alternate base. The required information to implement a feature is the chromosome name, the start position (one based), and the stop position. The start and stop position represent a closed, one-based interval. I.e. the first base in chromosome one would be chr1:1-1.

  • FeatureCodec

    This class takes in a line of text (from an input source, whether it's a file, compressed file, or a http link), and produces the above feature.

To implement your new format into Tribble, you need to implement the two above classes (in an appropriately named subfolder in the Tribble check-out). The Feature object should know nothing about the file representation; it should represent the data as an in-memory object. The interface for a feature looks like:

public interface Feature {

    /**
     * Return the features reference sequence name, e.g chromosome or contig
     */
    public String getChr();

    /**
     * Return the start position in 1-based coordinates (first base is 1)
     */
    public int getStart();

    /**
     * Return the end position following 1-based fully closed conventions.  The length of a feature is
     * end - start + 1;
     */
    public int getEnd();
}

And the interface for FeatureCodec:

/**
 * the base interface for classes that read in features.
 * @param <T> The feature type this codec reads
 */
public interface FeatureCodec<T extends Feature> {
    /**
     * Decode a line to obtain just its FeatureLoc for indexing -- contig, start, and stop.
     *
     * @param line the input line to decode
     * @return  Return the FeatureLoc encoded by the line, or null if the line does not represent a feature (e.g. is
     * a comment)
     */
    public Feature decodeLoc(String line);

    /**
     * Decode a line as a Feature.
     *
     * @param line the input line to decode
     * @return  Return the Feature encoded by the line,  or null if the line does not represent a feature (e.g. is
     * a comment)
     */
    public T decode(String line);

    /**
     * This function returns the object the codec generates.  This is allowed to be Feature in the case where
     * conditionally different types are generated.  Be as specific as you can though.
     *
     * This function is used by reflections based tools, so we can know the underlying type
     *
     * @return the feature type this codec generates.
     */
    public Class<T> getFeatureType();


    /**  Read and return the header, or null if there is no header.
     *
     * @return header object
     */
    public Object readHeader(LineReader reader);
}

4. Supported Formats

The following formats are supported in Tribble:

  • VCF Format
  • DbSNP Format
  • BED Format
  • GATK Interval Format

5. Updating the Tribble library

Updating the revision of Tribble on the system is a relatively straightforward task if the following steps are taken.

  • Make sure that you've checked your changes into Tribble; unversioned changes will be problematic, so you should always check in so that you have a unique version number to identify your release.

  • Once you've checked-in Tribble, make sure to svn update, and then run svnversion. This will give you a version number which you can use to name your release. Let's say it was 82. **If it contains an M (i.e. 82M) this means your version isn't clean (you have modifications that are not checked in), don't proceed`.

  • from the Tribble main directory, run ant clean, then ant (make sure it runs successfully), and ant test (also make sure it completes successfully).

  • copy dist/tribble-0.1.jar (or whatever the internal Tribble version currently is) to your checkout of the GATK, as the file ./settings/repository/org.broad/tribble-<svnversion>.jar.

  • Copy the current XML file to the new name, i.e. from the base GATK trunk directory:

    cp ./settings/repository/org.broad/tribble-.xml ./settings/repository/org.broad/tribble-.xml

  • Edit the ./settings/repository/org.broad/tribble-<svnversion>.xml with the new correct version number and release date (here we rev 81 to 82).

    This involves changing:

    <ivy-module version="1.0">
        <info organisation="org.broad" module="tribble" revision="81" status="integration" publication="20100526124200" />
    </ivy-module>
    

    To:

    <ivy-module version="1.0">
        <info organisation="org.broad" module="tribble" revision="82" status="integration" publication="20100528123456" />
    </ivy-module>
    

    Notice the change to the revision number and the publication date.

Notice the change to the revision number and the publication date.

  • Remove the old files svn remove ./settings/repository/org.broad/tribble-<current_svnversion>.*

  • Add the new files svn add ./settings/repository/org.broad/tribble-<new_svnversion>.*

  • Make sure you're using the new libraries to build: remove your ant cache: rm -r ~/.ant/cache.

  • Run an ant clean, and then make sure to test the build with ant integrationtest and ant test.

  • Any check-in from the base SVN directory will now rev the Tribble version.

See this announcement regarding our plans for support of DepthOfCoverage and DiagnoseTargets. If you find that there are functionalities missing in either tool, leave us a comment and we will consider adding them.

For a complete, detailed argument reference, refer to the GATK document page here.

Introduction

DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:

  • Total, mean, median, and quartiles for each partition type: aggregate
  • Total, mean, median, and quartiles for each partition type: for each interval
  • A series of histograms of the number of bases covered to Y depth for each partition type (granular; e.g. Y can be a range, like 16 to 22)
  • A matrix of counts of the number of intervals for which at least Y samples and/or read groups had a median coverage of at least X
  • A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)
  • A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.

For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.

DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.

Coverage by Gene

To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument

-geneList /path/to/gene/list.txt

The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,
587     NM_001005277    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F16  cmpl    cmpl    0,
587     NM_001005221    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F29  cmpl    cmpl    0,
589     NM_001005224    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F3   cmpl    cmpl    0,
589     NM_001005277    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F16  cmpl    cmpl    0,
589     NM_001005221    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F29  cmpl    cmpl    0,

If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at

/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt

If you supply the -geneList argument, DepthOfCoverage v3.0 will output an additional summary file that looks as follows:

Gene_Name     Total_Cvg       Avg_Cvg       Sample_1_Total_Cvg    Sample_1_Avg_Cvg    Sample_1_Cvg_Q3       Sample_1_Cvg_Median      Sample_1_Cvg_Q1
SORT1    594710  238.27  594710  238.27  165     245     330
NOTCH2  3011542 357.84  3011542 357.84  222     399     &gt;500
LMNA    563183  186.73  563183  186.73  116     187     262
NOS1AP  513031  203.50  513031  203.50  91      191     290

Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.

Introduction

SelectVariants is a GATK tool used to subset a VCF file by many arbitrary criteria listed in the command line options below. The output VCF wiil have the AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) annotations updated as necessary to accurately reflect the file's new contents.

Select Variants operates on VCF files (ROD Tracks) provided in the command line using the GATK's built in --variant option. You can provide multiple tracks for Select Variants but at least one must be named 'variant' and this will be the file all your analysis will be based of. Other tracks can be named as you please. Options requiring a reference to a ROD track name will use the track name provided in the -B option to refer to the correct VCF file (e.g. --discordance / --concordance ). All other analysis will be done in the 'variant' track.

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented here in the FAQ article on JEXL expressions; it is particularly important to note the section on working with complex expressions.

Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.

How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB        MARY        LINDA
1/0:20     0/0:30      1/1:50

In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB        MARY
1/0:20     0/0:30

The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB        MARY
1/0:20     ./.:.

with AN=2, AC=1, AF=0.5, and DP=20.

Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

  • isVariant => is there an ALT allele?

  • isPolymorphic => is some sample non-ref in the samples?

In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.

Known issues

Some VCFs may have repeated header entries with the same key name, for instance:

##fileformat=VCFv3.3
##FILTER=ABFilter,&quot;AB &gt; 0.75&quot;
##FILTER=HRunFilter,&quot;HRun &gt; 3.0&quot;
##FILTER=QDFilter,&quot;QD &lt; 5.0&quot;
##UG_bam_file_used=file1.bam
##UG_bam_file_used=file2.bam
##UG_bam_file_used=file3.bam
##UG_bam_file_used=file4.bam
##UG_bam_file_used=file5.bam
##source=UnifiedGenotyper
##source=VariantFiltration
##source=AnnotateVCFwithMAF
...

Here, the "UG_bam_file_used" and "source" header lines appear multiple times. When SelectVariants is run on such a file, the program will emit warnings that these repeated header lines are being discarded, resulting in only the first instance of such a line being written to the resulting VCF. This behavior is not ideal, but expected under the current architecture.

Additional information

For information on how to construct regular expressions for use with this tool, see the "Summary of regular-expression constructs" section here.

2 SNPs with significant strand bias

Several SNPs with excessive coverage

For a complete, detailed argument reference, refer to the GATK document page here.

Introduction

In addition to true variation, variant callers emit a number of false-positives. Some of these false-positives can be detected and rejected by various statistical tests. VariantAnnotator provides a way of annotating variant calls as preparation for executing these tests.

Description of the haplotype score annotation

Examples of Available Annotations

The list below is not comprehensive. Please use the --list argument to get a list of all possible annotations available. Also, see the FAQ article on understanding the Unified Genotyper's VCF files for a description of some of the more standard annotations.

Note that technically the VariantAnnotator does not require reads (from a BAM file) to run; if no reads are provided, only those Annotations which don't use reads (e.g. Chromosome Counts) will be added. But most Annotations do require reads. When running the tool we recommend that you add the -L argument with the variant rod to your command line for efficiency and speed.

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"
  • QUAL is a key: the name of the annotation we want to look at
  • 30.0 is a value: the threshold that we want to use to evaluate variant quality against
  • > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".

4. Important caveats

Missing annotations

It is very important to note that the JEXL evaluation subprogram cannot correctly handle cases where the annotations requested by the JEXL expression are missing for some variants in a VCF record. It will throw an exception (i.e. fail with an error) when it encounters this scenario. The default behavior of the GATK is to handle this by having the entire expression evaluate to FALSE in such cases (although some tools provide options to change this behavior). This is extremely important especially when constructing complex expressions, because it affects how you should interpret the result.

For example, looking again at that last expression:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

When run against a VCF record with INFO field QD=10.0;FS=300.0;ReadPosRankSum=-10.0 it will evaluate to TRUE because the FS value is greater than 200.0.

But when run against a VCF record with INFO field QD=10.0;FS=300.0 it will evaluate to FALSE because there is no ReadPosRankSum value defined at all and JEXL fails to evaluate it.

This means that when you're trying to filter out records with VariantFiltration, for example, the previous record would be marked as PASSing, even though it contains a bad FS value.

For this reason, we highly recommend that complex expressions involving OR operations be split up into separate expressions whenever possible. For example, the previous example would have 3 distinct expressions: "QD < 2.0", "ReadPosRankSum < -20.0", and "FS > 200.0". This way, although the ReadPosRankSum expression evaluates to FALSE when the annotation is missing, the record can still get filtered (again using the example of VariantFiltration) when the FS value is greater than 200.0.

Sensitivity to case and type

  • Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

  • Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

5. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Accessing the underlying VariantContext directly

If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the vc variable.

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'

6. Using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'

VariantEval accepts two types of modules: stratification and evaluation modules.

  • Stratification modules will stratify (group) the variants based on certain properties.
  • Evaluation modules will compute certain metrics for the variants

CpG

CpG is a three-state stratification:

  • The locus is a CpG site ("CpG")
  • The locus is not a CpG site ("non_CpG")
  • The locus is either a CpG or not a CpG site ("all")

A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.

EvalRod

EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.

Sample

Sample is an N-state stratification, where N is the number of samples in the eval files.

Filter

Filter is a three-state stratification:

  • The locus passes QC filters ("called")
  • The locus fails QC filters ("filtered")
  • The locus either passes or fails QC filters ("raw")

FunctionalClass

FunctionalClass is a four-state stratification:

  • The locus is a synonymous site ("silent")
  • The locus is a missense site ("missense")
  • The locus is a nonsense site ("nonsense")
  • The locus is of any functional class ("any")

CompRod

CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.

Degeneracy

Degeneracy is a six-state stratification:

  • The underlying base position in the codon is 1-fold degenerate ("1-fold")
  • The underlying base position in the codon is 2-fold degenerate ("2-fold")
  • The underlying base position in the codon is 3-fold degenerate ("3-fold")
  • The underlying base position in the codon is 4-fold degenerate ("4-fold")
  • The underlying base position in the codon is 6-fold degenerate ("6-fold")
  • The underlying base position in the codon is degenerate at any level ("all")

See the [http://en.wikipedia.org/wiki/Genetic_code#Degeneracy Wikipedia page on degeneracy] for more information.

JexlExpression

JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]

Novelty

Novelty is a three-state stratification:

  • The locus overlaps the knowns comp track (usually the dbSNP track) ("known")
  • The locus does not overlap the knowns comp track ("novel")
  • The locus either overlaps or does not overlap the knowns comp track ("all")

CountVariants

CountVariants is an evaluation module that computes the following metrics:

Metric Definition
nProcessedLoci Number of processed loci
nCalledLoci Number of called loci
nRefLoci Number of reference loci
nVariantLoci Number of variant loci
variantRate Variants per loci rate
variantRatePerBp Number of variants per base
nSNPs Number of snp loci
nInsertions Number of insertion
nDeletions Number of deletions
nComplex Number of complex loci
nNoCalls Number of no calls loci
nHets Number of het loci
nHomRef Number of hom ref loci
nHomVar Number of hom var loci
nSingletons Number of singletons
heterozygosity heterozygosity per locus rate
heterozygosityPerBp heterozygosity per base pair
hetHomRatio heterozygosity to homozygosity ratio
indelRate indel rate (insertion count + deletion count)
indelRatePerBp indel rate per base pair
deletionInsertionRatio deletion to insertion ratio

CompOverlap

CompOverlap is an evaluation module that computes the following metrics:

Metric Definition
nEvalSNPs number of eval SNP sites
nCompSNPs number of comp SNP sites
novelSites number of eval sites outside of comp sites
nVariantsAtComp number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same)
compRate percentage of eval sites at comp sites
nConcordant number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele)
concordantRate the concordance rate

Understanding the output of CompOverlap

A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):

 $ grep -v '##' dbsnp.vcf
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
 1       10327   rs112750067     T       C       .       .       ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132

Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:

 $ grep -v '##' eval_correct_allele.vcf
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT            001-6
 1       10327   .       T       C       5168.52 PASS    ...     GT:AD:DP:GQ:PL    0/1:357,238:373:99:3959,0,4059

Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:

 $ grep -v '##' eval_incorrect_allele.vcf
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT            001-6
 1       10327   .       T       A       5168.52 PASS    ...     GT:AD:DP:GQ:PL    0/1:357,238:373:99:3959,0,4059

Running VariantEval with just the CompOverlap module:

 $ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \
        -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \
        -L 1:10327 \
        -B:dbsnp,VCF dbsnp.vcf \
        -B:eval_correct_allele,VCF eval_correct_allele.vcf \
        -B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \
        -noEV \
        -EV CompOverlap \
        -o eval.table

We find that the eval.table file contains the following:

 $ grep -v '##' eval.table | column -t 
 CompOverlap  CompRod  EvalRod                JexlExpression  Novelty  nEvalVariants  nCompVariants  novelSites  nVariantsAtComp  compRate      nConcordant  concordantRate
 CompOverlap  dbsnp    eval_correct_allele    none            all      1              1              0           1                100.00000000  1            100.00000000
 CompOverlap  dbsnp    eval_correct_allele    none            known    1              1              0           1                100.00000000  1            100.00000000
 CompOverlap  dbsnp    eval_correct_allele    none            novel    0              0              0           0                0.00000000    0            0.00000000
 CompOverlap  dbsnp    eval_incorrect_allele  none            all      1              1              0           1                100.00000000  0            0.00000000
 CompOverlap  dbsnp    eval_incorrect_allele  none            known    1              1              0           1                100.00000000  0            0.00000000
 CompOverlap  dbsnp    eval_incorrect_allele  none            novel    0              0              0           0                0.00000000    0            0.00000000

As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.

TiTvVariantEvaluator

TiTvVariantEvaluator is an evaluation module that computes the following metrics:

Metric Definition
nTi number of transition loci
nTv number of transversion loci
tiTvRatio the transition to transversion ratio
nTiInComp number of comp transition sites
nTvInComp number of comp transversion sites
TiTvRatioStandard the transition to transversion ratio for comp sites

1. What it is and how it helps us improve the GATK

Since September, 2010, the GATK has had a "phone-home" feature that sends us information about each GATK run via the Broad filesystem (within the Broad) and Amazon's S3 cloud storage service (outside the Broad). This feature is enabled by default.

The information provided by the phone-home feature is critical in driving improvements to the GATK

  • By recording detailed information about each error that occurs, it enables GATK developers to identify and fix previously-unknown bugs in the GATK. We are constantly monitoring the errors our users encounter and do our best to fix those errors that are caused by bugs in our code.
  • It allows us to better understand how the GATK is used in practice and adjust our documentation and development goals for common use cases.
  • It gives us a picture of which versions of the GATK are in use over time, and how successful we've been at encouraging users to migrate from obsolete or broken versions of the GATK to newer, improved versions.
  • It tells us which tools are most commonly used, allowing us to monitor the adoption of newly-released tools and abandonment of outdated tools.
  • It provides us with a sense of the overall size of our user base and the major organizations/institutions using the GATK.

2. What information is sent to us

Below are two example GATK Run Reports showing exactly what information is sent to us each time the GATK phones home.

A successful run:

<GATK-run-report>
    <id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
    <start-time>2012/03/10 20.21.19</start-time>
    <end-time>2012/03/10 20.21.19</end-time>
    <run-time>0</run-time>
    <walker-name>CountReads</walker-name>
    <svn-version>1.4-483-g63ecdb2</svn-version>
    <total-memory>85000192</total-memory>
    <max-memory>129957888</max-memory>
    <user-name>depristo</user-name>
    <host-name>10.0.1.10</host-name>
    <java>Apple Inc.-1.6.0_26</java>
    <machine>Mac OS X-x86_64</machine>
    <iterations>105</iterations>
</GATK-run-report>

A run where an exception has occurred:

<GATK-run-report>
   <id>yX3AnltsqIlXH9kAQqTWHQUd8CQ5bikz</id>   
   <exception>
      <message>Failed to parse Genome Location string: 20:10,000,000-10,000,001x</message>
      <stacktrace class="java.util.ArrayList"> 
         <string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:377)</string>
         <string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
         <string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
         <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
         <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
         <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
         <string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
         <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
         <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
         <string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
      </stacktrace>
      <cause>
         <message>Position: &apos;10,000,001x&apos; contains invalid chars.</message>
         <stacktrace class="java.util.ArrayList">
            <string>org.broadinstitute.sting.utils.GenomeLocParser.parsePosition(GenomeLocParser.java:411)</string>
            <string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:374)</string>
            <string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
            <string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
            <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
            <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
            <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
            <string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
            <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
            <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
            <string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
         </stacktrace>
         <is-user-exception>false</is-user-exception>
      </cause>
      <is-user-exception>true</is-user-exception>
   </exception>
   <start-time>2012/03/10 20.19.52</start-time>
   <end-time>2012/03/10 20.19.52</end-time>
   <run-time>0</run-time>
   <walker-name>CountReads</walker-name>
   <svn-version>1.4-483-g63ecdb2</svn-version>
   <total-memory>85000192</total-memory>
   <max-memory>129957888</max-memory>
   <user-name>depristo</user-name>
   <host-name>10.0.1.10</host-name>
   <java>Apple Inc.-1.6.0_26</java>
   <machine>Mac OS X-x86_64</machine>
   <iterations>0</iterations>
</GATK-run-report>

Note that as of GATK 1.5 we no longer collect information about the command-line executed, the working directory, or tmp directory.

3. Disabling Phone Home

The GATK is currently in the process of evolving to require interaction with Amazon S3 as a normal part of each run. For this reason, and because the information contained in the GATK run reports is so critical in driving improvements to the GATK, we strongly discourage our users from disabling the phone-home feature.

At the same time, we recognize that some of our users do have legitimate reasons for needing to run the GATK with phone-home disabled, and we don't wish to make it impossible for these users to run the GATK.

Examples of legitimate reasons for disabling Phone Home

  • Technical reasons: Your local network might have restrictions in place that don't allow the GATK to access external resources, or you might need to run the GATK in a network-less environment.

  • Organizational reasons: Your organization's policies might forbid the dissemination of one or more pieces of information contained in the GATK run report.

For such users we have provided an -et NO_ET option in the GATK to disable the phone-home feature. To use this option in GATK 1.5 and later, you need to contact us to request a key. Instructions for doing so are below.

How to obtain and use a GATK key

To obtain a GATK key, please fill out the request form.

Running the GATK with a key is simple: you just need to append a -K your.key argument to your customary command line, where your.key is the path to the key file you obtained from us:

java -jar dist/GenomeAnalysisTK.jar \
    -T PrintReads \
    -I public/testdata/exampleBAM.bam \
    -R public/testdata/exampleFASTA.fasta \
    -et NO_ET \
    -K your.key

The -K argument is only necessary when running the GATK with the NO_ET option.

Troubleshooting key-related problems

  • Corrupt/Unreadable/Revoked Keys

If you get an error message from the GATK saying that your key is corrupt, unreadable, or has been revoked, please email '''gsahelp@broadinstitute.org''' to ask for a replacement key.

  • GATK Public Key Not Found

If you get an error message stating that the GATK public key could not be located or read, then something is likely wrong with your build of the GATK. If you're running the binary release, try downloading it again. If you're compiling from source, try doing an ant clean and re-compiling. If all else fails, please ask for help on our community forum.

What does GSA use Phone Home data for?

We use the phone home data for three main purposes. First, we monitor the input logs for errors that occur in the GATK, and proactively fix them in the codebase. Second, we monitor the usage rates of the GATK in general and specific versions of the GATK to explain how widely used the GATK is to funding agencies and other potential supporters. Finally, we monitor adoption rates of specific GATK tools to understand how quickly new tools reach our users. Many of these analyses require us to aggregate the data by unique user, which is why we still collect the username of the individual who ran the GATK (as you can see in the plots). Examples of all three uses are shown in the Tableau graphs below, which update each night and are sent to the GATK members each morning for review.

A GATKReport is simply a text document that contains well-formatted, easy to read representation of some tabular data. Many GATK tools output their results as GATKReports, so it's important to understand how they are formatted and how you can use them in further analyses.

Here's a simple example:

#:GATKReport.v1.0:2
#:GATKTable:true:2:9:%.18E:%.15f:;
#:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads
cycle  errorrate.61PA8.7         qualavg.61PA8.7                                         
0      7.451835696110506E-3      25.474613284804366                                      
1      2.362777171937477E-3      29.844949954504095                                      
2      9.087604507451836E-4      32.875909752547310
3      5.452562704471102E-4      34.498999090081895                                      
4      9.087604507451836E-4      35.148316651501370                                       
5      5.452562704471102E-4      36.072234352256190                                       
6      5.452562704471102E-4      36.121724890829700                                        
7      5.452562704471102E-4      36.191048034934500                                        
8      5.452562704471102E-4      36.003457059679770                                       

#:GATKTable:false:2:3:%s:%c:;
#:GATKTable:TableName:Description
key    column
1:1000  T 
1:1001  A 
1:1002  C 

This report contains two individual GATK report tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data.

We provide an R library called gsalib that allows you to load GATKReport files into R for further analysis. Here are the five simple steps to getting gsalib, installing it and loading a report.

1. Get the GATK source code on GitHub

Please visit the Downloads page for instructions.

2. Compile the gsalib library

$ ant gsalib
Buildfile: build.xml

gsalib:
     [exec] * installing *source* package ?gsalib? ...
     [exec] ** R
     [exec] ** data
     [exec] ** preparing package for lazy loading
     [exec] ** help
     [exec] *** installing help indices
     [exec] ** building package indices ...
     [exec] ** testing if installed package can be loaded
     [exec] 
     [exec] * DONE (gsalib)

BUILD SUCCESSFUL

3. Tell R where to find the gsalib library by adding the path in your ~/.Rprofile (you may need to create this file if it doesn't exist)

$ cat .Rprofile 
.libPaths("/path/to/Sting/R/")

4. Start R and load the gsalib library

$ R

R version 2.11.0 (2010-04-22)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(gsalib)

5. Finally, load the GATKReport file and have fun

> d = gsa.read.gatkreport("/path/to/my.gatkreport")
> summary(d)
              Length Class      Mode
CountVariants 27     data.frame list
CompOverlap   13     data.frame list

New WGS and WEx CEU trio BAM files

We have sequenced at the Broad Institute and released to the 1000 Genomes Project the following datasets for the three members of the CEU trio (NA12878, NA12891 and NA12892):

  • WEx (150x) sequence
  • WGS (>60x) sequence

This is better data to work with than the original DePristo et al. BAMs files, so we recommend you download and analyze these files if you are looking for complete, large-scale data sets to evaluate the GATK or other tools.

Here's the rough library properties of the BAMs:

CEU trio BAM libraries

These data files can be downloaded from the 1000 Genomes DCC

NA12878 Datasets from DePristo et al. (2011) Nature Genetics

Here are the datasets we used in the GATK paper cited below.

DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D and Daly, M (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 43:491-498.

Some of the BAM and VCF files are currently hosted by the NCBI: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/

  • NA12878.hiseq.wgs.bwa.recal.bam -- BAM file for NA12878 HiSeq whole genome
  • NA12878.hiseq.wgs.bwa.raw.bam Raw reads (in BAM format, see below)
  • NA12878.ga2.exome.maq.recal.bam -- BAM file for NA12878 GenomeAnalyzer II whole exome (hg18)
  • NA12878.ga2.exome.maq.raw.bam Raw reads (in BAM format, see below)
  • NA12878.hiseq.wgs.vcf.gz -- SNP calls for NA12878 HiSeq whole genome (hg18)
  • NA12878.ga2.exome.vcf.gz -- SNP calls for NA12878 GenomeAnalyzer II whole exome (hg18)
  • BAM files for CEU + NA12878 whole genome (b36). These are the standard BAM files for the 1000 Genomes pilot CEU samples plus a 4x downsampled version of NA12878 from the pilot 2 data set, available in the DePristoNatGenet2011 directory of the GSA FTP Server
  • SNP calls for CEU + NA12878 whole genome (b36) are available in the DePristoNatGenet2011 directory of the GSA FTP Server
  • Crossbow comparison SNP calls are available in the DePristoNatGenet2011 directory of the GSA FTP Server as crossbow.filtered.vcf. The raw calls can be viewed by ignoring the FILTER field status
  • whole_exome_agilent_designed_120.Homo_sapiens_assembly18.targets.interval_list -- targets used in the analysis of the exome capture data

Please note that we have not collected the indel calls for the paper, as these are only used for filtering SNPs near indels. If you want to call accurate indels, please use the new GATK indel caller in the Unified Genotyper.

Warnings

Both the GATK and the sequencing technologies have improved significantly since the analyses performed in this paper.

  • If you are conducting a review today, we would recommend that the newest version of the GATK, which performs much better than the version described in the paper. Moreover, we would also recommend one use the newest version of Crossbow as well, in case they have improved things. The GATK calls for NA12878 from the paper (above) will give one a good idea what a good call set looks like whole-genome or whole-exome.

  • The data sets used in the paper are no longer state-of-the-art. The WEx BAM is GAII data aligned with MAQ on hg18, but a state-of-the-art data set would use HiSeq and BWA on hg19. Even the 64x HiSeq WG data set is already more than one year old. For a better assessment, we would recommend you use a newer data set for these samples, if you have the capacity to generate it. This applies less to the WG NA12878 data, which is pretty good, but the NA12878 WEx from the paper is nearly 2 years old now and notably worse than our most recent data sets.

Obviously, this was an annoyance for us as well, as it would have been nice to use a state-of-the-art data set for the WEx. But we decided to freeze the data used for analysis to actually finish this paper.

How do I get the raw FASTQ file from a BAM?

If you want the raw, machine output for the data analyzed in the GATK framework paper, obtain the raw BAM files above and convert them from SAM to FASTQ using the Picard tool SamToFastq.

Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs), so before you post this issue in our support forum you will first need to do a little investigation on your own.

To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.

Here is a checklist of questions you should ask yourself:

  • How many overlapping deletions are there at the position?

The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.

  • What do the base qualities look like for the non-reference bases?

Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.

  • What do the mapping qualities look like for the reads with the non-reference bases?

A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

  • Are there a lot of alternate alleles?

By default the UG will only consider a certain number of alternate alleles. This value can be set using the --max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.

  • Are you working with SOLiD data?

SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.

The GATKDocs are what we call "Technical Documentation" in the Guide section of this website. The HTML pages are generated automatically at build time from specific blocks of documentation in the source code.

The best place to look for example documentation for a GATK walker is GATKDocsExample walker in org.broadinstitute.sting.gatk.examples. This is available here.

Below is the reproduction of that file from August 11, 2011:

/**
 * [Short one sentence description of this walker]
 *
 * <p>
 * [Functionality of this walker]
 * </p>
 *
 * <h2>Input</h2>
 * <p>
 * [Input description]
 * </p>
 *
 * <h2>Output</h2>
 * <p>
 * [Output description]
 * </p>
 *
 * <h2>Examples</h2>
 * PRE-TAG
 *    java
 *      -jar GenomeAnalysisTK.jar
 *      -T $WalkerName
 * PRE-TAG
 *
 * @category Walker Category
 * @author Your Name
 * @since Date created
 */
public class GATKDocsExample extends RodWalker<Integer, Integer> {
    /**
     * Put detailed documentation about the argument here.  No need to duplicate the summary information
     * in doc annotation field, as that will be added before this text in the documentation page.
     *
     * Notes:
     * <ul>
     *     <li>This field can contain HTML as a normal javadoc</li>
     *     <li>Don't include information about the default value, as gatkdocs adds this automatically</li>
     *     <li>Try your best to describe in detail the behavior of the argument, as ultimately confusing
     *          docs here will just result in user posts on the forum</li>
     * </ul>
     */
    @Argument(fullName="full", shortName="short", doc="Brief summary of argument [~ 80 characters of text]", required=false)
    private boolean myWalkerArgument = false;

    public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) { return 0; }
    public Integer reduceInit() { return 0; }
    public Integer reduce(Integer value, Integer sum) { return value + sum; }
    public void onTraversalDone(Integer result) { }
}

Brief introduction to reference metadata (RMDs)

Note that the -B flag referred to below is deprecated; these docs need to be updated

The GATK allows you to process arbitrary numbers of reference metadata (RMD) files inside of walkers (previously we called this reference ordered data, or ROD). Common RMDs are things like dbSNP, VCF call files, and refseq annotations. The only real constraints on RMD files is that:

  • They must contain information necessary to provide contig and position data for each element to the GATK engine so it knows with what loci to associate the RMD element.

  • The file must be sorted with regard to the reference fasta file so that data can be accessed sequentially by the engine.

  • The file must have a Tribble RMD parsing class associated with the file type so that elements in the RMD file can be parsed by the engine.

Inside of the GATK the RMD system has the concept of RMD tracks, which associate an arbitrary string name with the data in the associated RMD file. For example, the VariantEval module uses the named track eval to get calls for evaluation, and dbsnp as the track containing the database of known variants.

How do I get reference metadata files into my walker?

RMD files are extremely easy to get into the GATK using the -B syntax:

java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:variant,VCF calls.vcf

In this example, the GATK will attempt to parse the file calls.vcf using the VCF parser and bind the VCF data to the RMD track named variant.

In general, you can provide as many RMD bindings to the GATK as you like:

java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:calls1,VCF calls1.vcf -B:calls2,VCF calls2.vcf

Works just as well. Some modules may require specifically named RMD tracks -- like eval above -- and some are happy to just assess all RMD tracks of a certain class and work with those -- like VariantsToVCF.

1. Directly getting access to a single named track

In this snippet from SNPDensityWalker, we grab the eval track as a VariantContext object, only for the variants that are of type SNP:

public Pair<VariantContext, GenomeLoc> map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
    VariantContext vc = tracker.getVariantContext(ref, "eval", EnumSet.of(VariantContext.Type.SNP), context.getLocation(), false);
}

2. Grabbing anything that's convertable to a VariantContext

From VariantsToVCF we call the helper function tracker.getVariantContexts to look at all of the RMDs and convert what it can to VariantContext objects.

Allele refAllele = new Allele(Character.toString(ref.getBase()), true);
Collection<VariantContext> contexts = tracker.getVariantContexts(INPUT_RMD_NAME, ALLOWED_VARIANT_CONTEXT_TYPES, context.getLocation(), refAllele, true, false);

3. Looking at all of the RMDs

Here's a totally general code snippet from PileupWalker.java. This code, as you can see, iterates over all of the GATKFeature objects in the reference ordered data, converting each RMD to a string and capturing these strings in a list. It finally grabs the dbSNP binding specifically for a more detailed string conversion, and then binds them all up in a single string for display along with the read pileup.

private String getReferenceOrderedData( RefMetaDataTracker tracker ) { ArrayList rodStrings = new ArrayList(); for ( GATKFeature datum : tracker.getAllRods() ) { if ( datum != null && ! (datum.getUnderlyingObject() instanceof DbSNPFeature)) { rodStrings.add(((ReferenceOrderedDatum)datum.getUnderlyingObject()).toSimpleString()); // TODO: Aaron: this line still survives, try to remove it } } String rodString = Utils.join(", ", rodStrings);

        DbSNPFeature dbsnp = tracker.lookup(DbSNPHelper.STANDARD_DBSNP_TRACK_NAME, DbSNPFeature.class);

        if ( dbsnp != null)
            rodString += DbSNPHelper.toMediumString(dbsnp);

        if ( !rodString.equals("") )
            rodString = "[ROD: " + rodString + "]";

        return rodString;
}

How do I write my own RMD types?

Tracks of reference metadata are loaded using the Tribble infrastructure. Tracks are loaded using the feature codec and underlying type information. See the Tribble documentation for more information.

Tribble codecs that are in the classpath are automatically found; the GATK discovers all classes that implement the FeatureCodec class. Name resolution occurs using the -B type parameter, i.e. if the user specified:

-B:calls1,VCF calls1.vcf

The GATK looks for a FeatureCodec called VCFCodec.java to decode the record type. Alternately, if the user specified:

-B:calls1,MYAwesomeFormat calls1.maft

THe GATK would look for a codec called MYAwesomeFormatCodec.java. This look-up is not case sensitive, i.e. it will resolve MyAwEsOmEfOrMaT as well, though why you would want to write something so painfully ugly to read is beyond us.

Sorry, there are no publicly available documents of this type with the tag #intermediate. Try one of the other types.

1. Notes on known sites

Why are they important?

Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.

In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.

Human genomes

If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.

Non-human genomes

If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.

And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!

Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.

2. Recommended sets of known sites per tool

Summary table

Tool dbSNP 129 - - dbSNP >132 - - Mills indels - - 1KG indels - - HapMap - - Omni
RealignerTargetCreator X X
IndelRealigner X X
BaseRecalibrator X X X
(UnifiedGenotyper/ HaplotypeCaller) X
VariantRecalibrator X X X X
VariantEval X

RealignerTargetCreator and IndelRealigner

These tools require known indels passed with the -known argument to function properly. We use both the following files:

  • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

BaseRecalibrator

This tool requires known SNPs and indels passed with the -knownSites argument to function properly. We use all the following files:

  • The most recent dbSNP release (build ID > 132)
  • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

UnifiedGenotyper / HaplotypeCaller

These tools do NOT require known sites, but if SNPs are provided with the -dbsnp argument they will use them for variant annotation. We use this file:

  • The most recent dbSNP release (build ID > 132)

VariantRecalibrator

This tool requires known SNPs and indels passed with the -resource argument to function properly. We use all the following files:

  • HapMap genotypes and sites
  • OMNI 2.5 genotypes and sites for 1000 Genomes samples
  • The most recent dbSNP release (build ID > 132)
  • Mills_and_1000G_gold_standard.indels.b37.sites.vcf

For best results, these resources should be passed with these parameters:

 -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
 -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
 -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
 -resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf 

VariantEval

This tool requires known SNPs passed with the -dbsnp argument to function properly. We use the following file:

  • A version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.

As featured in this forum question.

Two main things account for these kinds of differences, both linked to default behaviors of the tools:

1. The tools downsample to different depths of coverage

2. The tools apply different read filters

In both cases, you can end up looking at different sets or numbers of reads, which causes some of the annotation values to be different. It's usually not a cause for alarm. Remember that many of these annotations should be interpreted relatively, not absolutely.