Tagged with #faq
8 documentation articles | 1 announcement | 0 forum discussions


Comments (1)

Please note that GATK-Lite was retired in February 2013 when version 2.4 was released. See the announcement here.


You probably know by now that GATK-Lite is a free-for-everyone and completely open-source version of the GATK (licensed under the original MIT license).

But what's in the box? What can GATK-Lite do -- or rather, what can it not do that the full version (let's call it GATK-Full) can? And what does that mean exactly, in terms of functionality, reliability and power?

To really understand the differences between GATK-Lite and GATK-Full, you need some more information on how the GATK works, and how we work to develop and improve it.

First you need to understand what are the two core components of the GATK: the engine and tools (see picture below).

As explained here, the engine handles all the common work that's related to data access, conversion and traversal, as well as high-performance computing features. The engine is supported by an infrastructure of software libraries. If the GATK was a car, that would be the engine and chassis. What we call the **tools* are attached on top of that, and they provide the various analytical and processing functionalities like variant calling and base or variant recalibration. On your car, that would be headlights, airbags and so on.

Core GATK components

Second is how we work on developing the GATK, and what it means for how improvements are shared (or not) between Lite and Full.

We do all our development work on a single codebase. This means that everything --the engine and all tools-- is on one common workbench. There are not different versions that we work on in parallel -- that would be crazy to manage! That's why the version numbers of GATK-Lite and GATK-Full always match: if the latest GATK-Full version is numbered 2.1-13, then the latest GATK-Lite is also numbered 2.1-13.

The most important consequence of this setup is that when we make improvements to the infrastructure and engine, the same improvements will end up in GATK Lite and in GATK Full. So for the purposes of power, speed and robustness of the GATK that is determined by the engine, there is no difference between them.

For the tools, it's a little more complicated -- but not much. When we "build" the GATK binaries (the .jar files), we put everything from the workbench into the Full build, but we only put a subset into the Lite build. Note that this Lite subset is pretty big -- it contains all the tools that were previously available in GATK 1.x versions, and always will. We also reserve the right to add previews or not-fully-featured versions of the new tools that are in Full, at our discretion, to the Lite build.

So there are two basic types of differences between the tools available in the Lite and Full builds (see picture below).

  1. We have a new tool that performs a brand new function (which wasn't available in GATK 1.x), and we only include it in the Full build.

  2. We have a tool that has some new add-on capabilities (which weren't possible in GATK 1.x); we put the tool in both the Lite and the Full build, but the add-ons are only available in the Full build.

Tools in Lite vs. Full

Reprising the car analogy, GATK-Lite and GATK-Full are like two versions of the same car -- the basic version and the fully-equipped one. They both have the exact same engine, and most of the equipment (tools) is the same -- for example, they both have the same airbag system, and they both have headlights. But there are a few important differences:

  1. The GATK-Full car comes with a GPS (sat-nav for our UK friends), for which the Lite car has no equivalent. You could buy a portable GPS unit from a third-party store for your Lite car, but it might not be as good, and certainly not as convenient, as the Full car's built-in one.

  2. Both cars have windows of course, but the Full car has power windows, while the Lite car doesn't. The Lite windows can open and close, but you have to operate them by hand, which is much slower.

So, to summarize:

The underlying engine is exactly the same in both GATK-Lite and GATK-Full. Most functionalities are available in both builds, performed by the same tools. Some functionalities are available in both builds, but they are performed by different tools, and the tool in the Full build is better. New, cutting-edge functionalities are only available in the Full build, and there is no equivalent in the Lite build.

We hope this clears up some of the confusion surrounding GATK-Lite. If not, please leave a comment and we'll do our best to clarify further!

Comments (2)

1. What file formats do you support for interval lists?

We support three types of interval lists, as mentioned here. Interval lists should preferentially be formatted as Picard-style interval lists, with an explicit sequence dictionary, as this prevents accidental misuse (e.g. hg18 intervals on an hg19 file). Note that this file is 1-based, not 0-based (first position in the genome is position 1).

2. I have two (or more) sequencing experiments with different target intervals. How can I combine them?

One relatively easy way to combine your intervals is to use the online tool Galaxy, using the Get Data -> Upload command to upload your intervals, and the Operate on Genomic Intervals command to compute the intersection or union of your intervals (depending on your needs).

Comments (0)

1. What file formats do you support for variant callsets?

We support the Variant Call Format (VCF) for variant callsets. No other file formats are supported.

2. How can I know if my VCF file is valid?

VCFTools contains a validation tool that will allow you to verify it.

3. Are you planning to include any converters from different formats or allow different input formats than VCF?

No, we like VCF and we think it's important to have a good standard format. Multiplying formats just makes life hard for everyone, both developers and analysts.

Comments (29)

1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

  • It must be aligned to one of the references described here.
  • It must be sorted in coordinate order (not by queryname and not "unsorted").
  • It must list the read groups with sample names in the header.
  • Every read must belong to a read group.
  • The BAM file must pass Picard validation.

See the BAM specification for more information.

4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

5. How can I tell if my BAM file is sorted properly?

The easiest way to do it is to download Samtools and run the following command to examine the header of your file:

$ samtools view -H /path/to/my.bam
@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:247249719
@SQ     SN:2    LN:242951149
@SQ     SN:3    LN:199501827
@SQ     SN:4    LN:191273063
@SQ     SN:5    LN:180857866
@SQ     SN:6    LN:170899992
@SQ     SN:7    LN:158821424
@SQ     SN:8    LN:146274826
@SQ     SN:9    LN:140273252
@SQ     SN:10   LN:135374737
@SQ     SN:11   LN:134452384
@SQ     SN:12   LN:132349534
@SQ     SN:13   LN:114142980
@SQ     SN:14   LN:106368585
@SQ     SN:15   LN:100338915
@SQ     SN:16   LN:88827254
@SQ     SN:17   LN:78774742
@SQ     SN:18   LN:76117153
@SQ     SN:19   LN:63811651
@SQ     SN:20   LN:62435964
@SQ     SN:21   LN:46944323
@SQ     SN:22   LN:49691432
@SQ     SN:X    LN:154913754
@SQ     SN:Y    LN:57772954
@SQ     SN:MT   LN:16571
@SQ     SN:NT_113887    LN:3994
...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.

6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.

7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).

8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.

9. What's the meaning of the standard read group fields?

For technical details, see the SAM specification on the Samtools website.

Tag Importance SAM spec definition Meaning
ID Required Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read groupIDs may be modified when merging SAM files in order to handle collisions. Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model.
SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper.
PL Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data . It's a good idea to use this field.
LB DNA preparation library identify Essential for MarkDuplicates MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

We do not require value for the CN, DS, DT, PG, PI, or PU fields.

A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

9. My BAM file doesn't have read group and sample information. How do I add it?

Use Picard's AddOrReplaceReadGroups tool to add read group information.

10. How do I know if my BAM file is valid?

Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK.

11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.

Comments (1)

1. What is Scala?

Scala is a combination of an object oriented framework and a functional programming language. For a good introduction see the free online book Programming Scala.

The following are extremely brief answers to frequently asked questions about Scala which often pop up when first viewing or editing QScripts. For more information on Scala there a multitude of resources available around the web including the Scala home page and the online Scala Doc.

2. Where do I learn more about Scala?

  • http://www.scala-lang.org
  • http://programming-scala.labs.oreilly.com
  • http://www.scala-lang.org/docu/files/ScalaByExample.pdf
  • http://devcheatsheet.com/tag/scala/
  • http://davetron5000.github.com/scala-style/index.html

3. What is the difference between var and val?

var is a value you can later modify, while val is similar to final in Java.

4. What is the difference between Scala collections and Java collections? / Why do I get the error: type mismatch?

Because the GATK and Queue are a mix of Scala and Java sometimes you'll run into problems when you need a Scala collection and instead a Java collection is returned.

   MyQScript.scala:39: error: type mismatch;
     found   : java.util.List[java.lang.String]
     required: scala.List[String]
        val wrapped: List[String] = TextFormattingUtils.wordWrap(text, width)

Use the implicit definitions in JavaConversions to automatically convert the basic Java collections to and from Scala collections.

import collection.JavaConversions._

Scala has a very rich collections framework which you should take the time to enjoy. One of the first things you'll notice is that the default Scala collections are immutable, which means you should treat them as you would a String. When you want to 'modify' an immutable collection you need to capture the result of the operation, often assigning the result back to the original variable.

var str = "A"
str + "B"
println(str) // prints: A
str += "C"
println(str) // prints: AC

var set = Set("A")
set + "B"
println(set) // prints: Set(A)
set += "C"
println(set) // prints: Set(A, C)

5. How do I append to a list?

Use the :+ operator for a single value.

  var myList = List.empty[String]
  myList :+= "a"
  myList :+= "b"
  myList :+= "c"

Use ++ for appending a list.

  var myList = List.empty[String]
  myList ++= List("a", "b", "c")

6. How do I add to a set?

Use the + operator.

  var mySet = Set.empty[String]
  mySet += "a"
  mySet += "b"
  mySet += "c"

7. How do I add to a map?

Use the + and -> operators.

  var myMap = Map.empty[String,Int]
  myMap += "a" -> 1
  myMap += "b" -> 2
  myMap += "c" -> 3

8. What are Option, Some, and None?

Option is a Scala generic type that can either be some generic value or None. Queue often uses it to represent primitives that may be null.

  var myNullableInt1: Option[Int] = Some(1)
  var myNullableInt2: Option[Int] = None

9. What is _ / What is the underscore?

François Armand's slide deck is a good introduction: http://www.slideshare.net/normation/scala-dreaded

To quote from his slides:

Give me a variable name but
- I don't care of what it is
- and/or
- don't want to pollute my namespace with it

10. How do I format a String?

Use the .format() method.

This Java snippet:

String formatted = String.format("%s %i", myString, myInt);

In Scala would be:

val formatted = "%s %i".format(myString, myInt)

11. Can I use Scala Enumerations as QScript @Arguments?

No. Currently Scala's Enumeration class does not interact with the Java reflection API in a way that could be used for Queue command line arguments. You can use Java enums if for example you are importing a Java based walker's enum type.

If/when we find a workaround for Queue we'll update this entry. In the meantime try using a String.

Comments (3)

This document describes the resource datasets and arguments to use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

Resources for SNPs

  • True sites training resource: HapMap

    This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

  • True sites training resource: Omni

    This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

  • Non-true sites training resource: 1000G
    This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%). 17

  • Known sites resource, not used in training: dbSNP
    This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Resources for Indels

  • Known and true sites training resource: Mills
    This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Common, base command line

java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R path/to/reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -nt 4 \
   [SPECIFY TRUTH AND TRAINING SETS] \
   [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
   [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

Arguments for VariantRecalibrator command:

   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an DP -an InbreedingCoeff \
   -mode SNP \

Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

Important notes about annotations

Some of these annotations might not be the best for your particular dataset.

Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

Additionally, the UnifiedGenotyper produces a statistic called the HaplotypeScore which should be used for SNPs. This statistic isn't necessary for the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes.

The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples please omit this annotation from the command line.

Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

  • Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline)
  • Use the VQSR with the smaller variant callset but experiment with the precise argument settings (try adding --maxGaussians 4 to your command line, for example)

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

Arguments for VariantRecalibrator:

   --maxGaussians 4 \
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an QD -an DP -an FS -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
   -mode INDEL \

Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

Common, base command line

 
 java -Xmx3g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -tranchesFile path/to/input.tranches \
   -recalFile path/to/input.recal \
   -o path/to/output.recalibrated.filtered.vcf \
   [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
   [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \
 

SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.5 \
   -mode SNP \

Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.0 \
   -mode INDEL \
Comments (53)

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"
  • QUAL is a key: the name of the annotation we want to look at
  • 30.0 is a value: the threshold that we want to use to evaluate variant quality against
  • > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".

4. Important caveats

Sensitivity to case and type

  • Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

  • Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.

5. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Accessing the underlying VariantContext directly

If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the vc variable.

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'

6. Using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'
Comments (10)

Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs) and in the HaplotypeCaller (for all variants including indels). So, before you post this issue in our support forum, please do a little bit of investigation on your own, as follows.

To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.

Here is a checklist of questions you should ask yourself:

  • How many overlapping deletions are there at the position?

The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.

  • What do the base qualities look like for the non-reference bases?

Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.

  • What do the mapping qualities look like for the reads with the non-reference bases?

A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

  • Are there a lot of alternate alleles?

By default the UG will only consider a certain number of alternate alleles. This value can be set using the --max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.

  • Are you working with SOLiD data?

SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.

  • Specifically for Haplotype Caller

In addition to the reasons above, Haplotype Caller has another reason why some variants do not get called when it seems obvious in the original bam file.

Haplotype Caller performs a local reassembly of the reads in the active region. Please refer here for more details: http://www.broadinstitute.org/gatk/guide/article?id=4148

This reassembly is important, because when mapping reads to the whole genome, it is easy to make an error. When reassembling reads in a much smaller region, such as the active region, the alignment is more likely to be accurate.

The reads you see in the alignment of the original bam file may suggest a variant should be called. However, due to the realignment, the reads may no longer support the variant. In order to see the new alignment of reads, you can use -bamout argument. You can then compare the aligned reads from the original bam file to the newly aligned reads in the -bamout file.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

Comments (1)

GATK 2.0

On July 23rd, 2012, the Genome Sequencing and Analysis (GSA) team will release a beta of GATK 2.0. GATK 2.0 includes all of the original GATK 1.x tools as well as many newer and more advanced tools for error modeling, data compression, and variant calling:
  • Base quality score recalibration (BQSR) v2, an upgrade to BQSR that generates a base substitution, insertion, and deletion error model.
  • ReduceReads, a BAM compression algorithm that reduces file sizes by 20x-100x while preserving all information necessary for accurate SNP and indel calling. ReduceReads enables the GATK to call tens of thousands of deeply sequenced NGS samples simultaneously.
  • The HaplotypeCaller, a multi-sample local de novo assembly and integrated SNP, indel, and short SV caller.
  • Powerful extensions to the Unified Genotyper to support variant calling of pooled samples, mitochondrial DNA, and non-diploid organisms. Additionally, the extended Unified Genotyper introduces a novel error modeling approach that uses a reference sample to build a site-specific error model for SNPs and indels that vastly improves calling accuracy.

GATK 2.0 is moving to a mixed open/closed-source model

The complete GATK 2.0 suite will be distributed as a binary only, without source code for the newest tools. We plan to release the source code for these tools, but its unclear the timeframe for this. The GATK engine and programming libraries will remain open-sourced under the MIT license, as they currently are for GATK 1.0. The current GATK 1.0 tool chain, now called GATK-lite, will remain open-source under the MIT license and distributed as a companion binary to the full GATK binary. GATK-lite includes the original base quality score recalibrator (BQSR), indel realigner, unified genotyper v1, and VQSR v2.

GATK 2.0 beta

The GATK 2.0 tools are under active development but they have matured to the point that non-Broad academics and researchers are welcome to use them. We appreciate feedback on their use, both successes and failures. Please be aware that the GATK 2.0 tool chain may be unstable, slow, not scalable, poorly documented, or not interact seamlessly among each other or with other tools in the suite, so could require more effort from users. With these caveats, these tools provide radically improved calling sensitivity, specificity, and performance, so are worth the exposure as beta software.

GATK licensing during and after the beta

GATK 2.0 is being released under a software license that permits non-commercial research use only. Until the beta ends and the full GATK 2.0 suite is officially launched, commercial activities should use the unrestricted GATK-lite version. In the fall we intend to release the full version of GATK 2.0. The full version will be free-to-use version for non-commercial entities, just like the beta. A commercial license will be required for commercial entities. This commercial version will include commercial-grade support for installation, configuration, and documentation, as well as long-term support for each commercial release.

Frequently Asked Questions (FAQ)

Basic

What does it mean that GATK 2.0 is a beta version? Is it safe to use these tools? Yes, the GATK 2.0 tools are actually quite stable and have been (relatively) widely used by the GSA team to make variation calls for large-scale projects like 1000 Genomes and T2D-GENES. They are beta because they haven’t been used outside of the Broad Institute, and so are likely to have bugs and other usability issues we will need to address over time. Furthermore, they are all evolving rapidly as we improve them and use them in more demanding settings, and so they are expected to change significantly over the next few years as they are perfected, just as happened with the GATK 1.0 suite of tools. How can I find out best practices for using GATK 2.0 tools? The best place is our newly revised GATK Best Practice v4 guide, which will be finalized on July 23rd here: http://gatk.vanillaforums.com/discussion/18/experimental-best-practice-variant-detection-with-the-gatk-v4-for-release-2-0 Where can I find out more information about the new GATK 2.0 tools? The best place is in our slide archive, where the GSA team has collected many presentations detailing the evolution and analysis of the new 2.0 tools. https://www.dropbox.com/sh/e31kvbg5v63s51t/6GdimgsKss How do I download GATK 2.0? From the new GATK website. How do you support GATK 2.0 tools? Using the new GATK support forum: http://gatk.vanillaforums.com/ Content from the old GetSatisfaction support forum will be transitioned over to the new GATK forums over the next few weeks. These new forums should help users find the answers to their questions much more easily (the new forum is indexed by Google) as well as allow users to more easily follow each other's post and answer each others questions. What specifically is in the new GATK 2.0 beta license? Please see the new GATK2 license terms here, in its temporary home: https://www.dropbox.com/s/giru0lqihudhy0s/GATK2_beta_license.pdf When GATK 2.0 is released to github the new terms will be available: https://github.com/broadgsa/gatk/blob/master/licensing/GATK2_beta_license.pdf How do you administer the new GATK 2.0 license? Once you register with the new GATK 2.0 forums, you can download the full GATK 2.0 and agree to the new license terms. Will you continue to support retired tools like the first version of BQSR? No, in the medium-term (next few months) retired tools will no longer be supported. In the short-term, yes, we will continue to respond to support requests on the new forums while people transition to GATK 2.0. What is GATK-lite? GATK-lite is a subset of the full GATK 2.0 release that is free-to-use for all entities, including commercial ones. It includes all of the capabilities (if not the exact tools) from GATK 1.6 but none of the exclusive 2.0 tools. For the tech-savvy, GATK-lite is the binary distribution corresponding to the public GATK source released on github. Everything in GATK-lite is licensed under the MIT license. GATK-Lite includes the entire GATK map/reduce programming framework, all associated GATK libraries, and the vast majority of GATK tools. For example, most of the Unified Genotyper, all of Indel Realigner, CombineVariants, SelectVariants, and VariantEval are all part of GATK-Lite. All of GATK-Queue is also part of GATK-Lite. The best way to think about GATK (implicitly -full) and GATK-lite is that both GATK and GATK-lite are GATK2, but -lite includes only the open sourced framework, library, and tools. GATK2 is effectively a superset of GATK-lite, built from the same source code, that includes a few closed-source premium tools. GATK-Lite isn't a dead-end branch of GATK1. All GATK-Lite infrastructure will be fully supported -- to the same degree as GATK1 -- by the GSA team, as we will rely on these tools day-in and day-out. GATK-Lite is evolve in lock-step with the full GATK, GATK-Lite and GATK(-full) will carry the same release numbers, and will be pushed out by the GSA group simultaneously. As we add new file formats to the GATK (BCF2, for example) these changes will go into the core of GATK, and be available through both GATK and GATK-Lite.

Licensing

I’m an academic researchers and I’d like to use GATK 2.0, what do I need to do? Just download the full GATK jar from our new website, after agreeing to the GATK 2.0 license terms. I run a genome sequencing center / NGS core facility at an academic institution, can I install GATK 2.0 for my users to run? Yes, you can upgrade to GATK 2.0, assuming your institution is an academic non-commercial entity. The GATK 2.0 beta license explicitly allows the installation in a pipeline. I work in a clinical lab at a hospital, and we use GATK 1.0 tools in our diagnostics lab, can I use GATK 2.0? Yes and no. If the lab engages solely in non-commercial activities (such as clinical research) then yes. If the lab “sells” diagnostic services to health care providers then no, you’ll need to wait until the commercial license is available to upgrade. If the lab does a mix of both, you are welcome to provide a 2.0 pipeline for non-commercial uses while maintaining a 1.0 pipeline for commercial users, until you can upgrade to the commercial license. I work in a government facility, can I use GATK 2.0? Yes, the GATK development was and is supported by U.S. federal research grants that entitle government researchers to use the GATK. I work in a commercial entity (i.e., for-profit company), can I use GATK 2.0? No, you’ll need to wait until a commercial license version is available later this year. The only exception is answered in the next question. Even though I work for a commercial entity, I’d like to evaluate GATK 2.0 in a non-production way; is that permitted under the GATK 2.0 beta? Yes, we view evaluating and exploring the new tools in the GATK 2.0 as “research,” but this requires a special license that can only be obtained from the Broad’s business development office. Please contact Issi Rozen (irozen@) to obtain this commercial evaluation license. Note that this license is only valid until the full commercial release which will likely have it’s own trial version for commercial entities. I built a portal that provides access directly to GATK 2.0 tools, how does the new GATK 2.0 license affect me? As the GATK 2.0 beta license forbids redistribution of GATK 2.0 tools, you must ensure that these tools are only accessible to users within your institution. You are welcome to install and provide access to tools in GATK-lite, though. GATK-lite contains all of the code -- with a completely non-restrictive MIT license -- available in the latest GATK 1.6 release. We are actively interested in defining reasonable use terms for third-party pipelines, so please contact Mark DePristo (depristo@) to discuss the matter further. I work at an academic non-commercial institution, and I built a NGS pipeline that runs GATK tools on sequencing data. We often distribute BAMs and VCFs processed by GATK to our collaborators both within and external to the institution, how does the new GATK 2.0 license affect me? Very little. The GATK 2.0 license allows academic non-commercial institutions to install, run, and distribute GATK-based results. Commercial institutions, however, are not permitted to use the GATK 2.0 beta, so can only do this using the unrestricted GATK-lite distribution. Note that when the commercial version of the GATK becomes available in late 2012, commercial institutions will have the opportunity to run and distribute the results of GATK 2.0 tools. I work at an academic institution, and we conduct sponsored research on behalf of commercial entities, how does the new GATK 2.0 license affect me? The GATK 2.0 licence explicitly allows an academic non-commercial entity to run GATK2.0 as part of sponsored research projects. I make NGS instruments and have embedded GATK in my instrument, how does the new GATK 2.0 license affect me? The current GATK 2.0 license forbids redistribution of GATK 2.0 binaries, so you will not be able to download GATK 2.0 and redistribute it on your instruments. Of course, you are welcome to redistribute GATK-lite. We envision that instrument manufactures will be able to purchase a commercial, redistribution license when the the full commercial GATK version is available in late 2012. We are actively interested in defining reasonable use terms for instrument manufacturers who want to embed GATK, so please contact Mark DePristo (depristo@) to discuss the matter further. I compile and distribute an NGS analysis suite that includes the GATK. How does the new GATK 2.0 license affect me? The new GATK 2.0 license forbids redistribution of the GATK 2.0 tools. You will not be able to include GATK 2.0 tools in your distribution going forward. You are welcome to install and distribute the tools GATK-lite, though, which is effectively what you have been doing with GATK 1.6. We are actively interested in defining reasonable use terms for third-party redistribution of the premium GATK2 suite, so please contact Mark DePristo (depristo@) to discuss the matter further. I took the GATK and rewrote parts of the engine or individual tools to make them faster or better. How does the new 2.0 license affect me? Very little. The GATK map/reduce programming framework and all associated libraries will continue to be available at github under the MIT license. Any improvements you made to the framework will continue to be viable and can be made available to the community in any way you see fit (see below for additional details). Most of the GATK 1.6 tools remains in the new GATK-lite distribution on github as well, so improvements to any of those tools will remain valuable, and can be redistributed freely as the GATK-lite tools all have an MIT license. Applying your engine optimizations to the new, protected GATK 2.0 tools can only be explored through a formal collaboration with the GATK team. I built several NGS tools on top of the GATK, how does the new GATK 2.0 license affect me? The GATK map/reduce programming framework and all associated libraries will continue to be available at github under the MIT license (i.e., the distribution known as GATK-Lite). We recommend you use and redistribute the GATK framework along with any independently written tools in any way and under any license you choose via the GATK-Lite github distribution mechanism. Several Broad Institute tools from the Cancer Genome Analysis team are distributed in just this way. See questions about GATK-Lite for more details as well, as this covers the framework and many associated tools in more detail.

Commercial GATK

What features will the commercial version of the GATK have? The commercial version of the GATK aims to be a slower evolving, better documented, and better supported version of the one released by the GSA team at the Broad. This means that the commercial version will not contain all of the bleeding edge features of the non-commercial GATK release. Moreover, the commercial version will not contain any significant additional features not available in the non-commercial version. That said, the commercial version of the GATK will come with vastly better support than the non-commercial version. This includes
  • More extensive installation and use documentation.
  • “consulting” services to help configure and optimize GATK for a particular NGS sequencing process
  • A phone number to call when something is broken along with a service-level agreement that guarantees a response to the problem within a reasonable timeframe.
  • Long-term support for each released version of the commercial GATK, a particularly important feature for users interested in using the GATK in certified processes, such as CLIA.
Will the commercial version include features not available in the non-commercial of the GATK? No. See “What will be the relationship between the commercial and GSA released version of the GATK?” below for more information Will I be able to purchase the commercial version of the GATK even if I work in a non-commercial entity? Yes, you can. With the commercial version you will gain access to the improved documentation and support. Why did you decide to restrict the GATK 2.0 beta to non-commercial entities? Because commercial entities will be required to purchase a software license to use the full GATK 2.0 suite. When will the commercial version be available? In late 2012. How much will the commercial version of the GATK cost? The specific pricing model has yet to be determined. What will be the relationship between the commercial and GSA released version of the GATK? The GSA team will continue to release GATK versions on the 6-8 week timeframe, following the standard 2.0, 2.1, 2.2, etc. version convention. These will be available to non-commercial entities. The commercial version will evolve at a slower rate and aggregate many GSA GATK versions into larger commercial releases with much more extensive configuration and use documentation. Unlike the GSA released versions, where the current release is the only supported one, the commercial versions will each be supported for much longer periods of time.

Source code

Why did you decide to make parts of the GATK closed source? About a year ago we started to develop the tools that ultimately became part of the binary-only GATK 2.0, including BQSRv2, the advanced UG modules, ReducedReads, and the HaplotypeCaller. From the start these tools were kept private to the master GATK repository, as they were all completely unstable, unusable, and unpublished. As they have evolved into their now usable forms we wanted to share these tools with the community as soon as possible, before any papers, patents, or other forms of intellectual property protection were in place. Releasing binary versions allows us to share our capabilities early while ensuring some IP protection. Additionally, a closed source model allows us more flexibility in the software licensing terms we enforce with the GATK 2.0. In particular reserving a subset of closed source tools protected by a non-commercial use license allows us to ensure that the research community has access to GATK tools as quickly as possible while preserving the value of a version of GATK licensed to commercial entities. What GATK source code is available through github? The released GATK source code on github includes the latest GATK programming framework and most GATK tools, including everything from GATK 1.x, as well as associated test files and build scripts. The only material not pushed up to github from the master GATK repository are private tools (not shared in source or binary form) and protected tools (available in binary form only). Can I get a copy of the source code for the new GATK 2.0 tools like ReducedReads? No, the source code for some of the new GATK 2.0 tools is not being released, and are only available in binary form (i.e., as compiled java JVM instructions). Are you open to collaboration to obtain access to GATK 2.0 tool source code? Yes, several long-term close collaborators have access to the full GATK repository with public and private libraries. Please contact Mark DePristo (depristo@) to discuss this possibility further. The GATK makes use of open source libraries, how do you comply with their license restrictions in a closed source GATK? We provide, upon request, a GenomeAnalysisTK.jar file built without pre-packaging any of our dependencies, which can be used to independently link the master GATK jar to any version of our dependencies. Will you ever make the new GATK 2.0 tools open source? Yes, over time we plan to migrate closed source tools into the open source branch of the GATK. What source is included with GATK-Lite GATK-Lite is basically everything in GATK 1.6, including the entire GATK programming framework and all associated libraries. It also includes the vast majority of tools in the GATK -- only a few select, premium analysis tools like the HaplotypeCaller are only in the full version of the GATK. Improves to the GATK framework, libraries, and lite tools will all continue to be developed and released as part of the GATK-lite distribution.
No posts found with the requested search criteria.