On July 23rd, 2012, the Genome Sequencing and Analysis (GSA) team will release a beta of GATK 2.0. GATK 2.0 includes all of the original GATK 1.x tools as well as many newer and more advanced tools for error modeling, data compression, and variant calling:
The complete GATK 2.0 suite will be distributed as a binary only, without source code for the newest tools. We plan to release the source code for these tools, but its unclear the timeframe for this. The GATK engine and programming libraries will remain open-sourced under the MIT license, as they currently are for GATK 1.0. The current GATK 1.0 tool chain, now called GATK-lite, will remain open-source under the MIT license and distributed as a companion binary to the full GATK binary. GATK-lite includes the original base quality score recalibrator (BQSR), indel realigner, unified genotyper v1, and VQSR v2.
The GATK 2.0 tools are under active development but they have matured to the point that non-Broad academics and researchers are welcome to use them. We appreciate feedback on their use, both successes and failures. Please be aware that the GATK 2.0 tool chain may be unstable, slow, not scalable, poorly documented, or not interact seamlessly among each other or with other tools in the suite, so could require more effort from users. With these caveats, these tools provide radically improved calling sensitivity, specificity, and performance, so are worth the exposure as beta software.
GATK 2.0 is being released under a software license that permits non-commercial research use only. Until the beta ends and the full GATK 2.0 suite is officially launched, commercial activities should use the unrestricted GATK-lite version.
In the fall we intend to release the full version of GATK 2.0. The full version will be free-to-use version for non-commercial entities, just like the beta. A commercial license will be required for commercial entities. This commercial version will include commercial-grade support for installation, configuration, and documentation, as well as long-term support for each commercial release.
What does it mean that GATK 2.0 is a beta version? Is it safe to use these tools? Yes, the GATK 2.0 tools are actually quite stable and have been (relatively) widely used by the GSA team to make variation calls for large-scale projects like 1000 Genomes and T2D-GENES. They are beta because they haven’t been used outside of the Broad Institute, and so are likely to have bugs and other usability issues we will need to address over time. Furthermore, they are all evolving rapidly as we improve them and use them in more demanding settings, and so they are expected to change significantly over the next few years as they are perfected, just as happened with the GATK 1.0 suite of tools.
How can I find out best practices for using GATK 2.0 tools? The best place is our newly revised GATK Best Practice v4 guide, which will be finalized on July 23rd here:
Where can I find out more information about the new GATK 2.0 tools? The best place is in our slide archive, where the GSA team has collected many presentations detailing the evolution and analysis of the new 2.0 tools.
How do I download GATK 2.0? From the new GATK website.
How do you support GATK 2.0 tools? Using the new GATK support forum:
Content from the old GetSatisfaction support forum will be transitioned over to the new GATK forums over the next few weeks. These new forums should help users find the answers to their questions much more easily (the new forum is indexed by Google) as well as allow users to more easily follow each other's post and answer each others questions.
What specifically is in the new GATK 2.0 beta license? Please see the new GATK2 license terms here, in its temporary home:
When GATK 2.0 is released to github the new terms will be available:
How do you administer the new GATK 2.0 license? Once you register with the new GATK 2.0 forums, you can download the full GATK 2.0 and agree to the new license terms.
Will you continue to support retired tools like the first version of BQSR? No, in the medium-term (next few months) retired tools will no longer be supported. In the short-term, yes, we will continue to respond to support requests on the new forums while people transition to GATK 2.0.
What is GATK-lite? GATK-lite is a subset of the full GATK 2.0 release that is free-to-use for all entities, including commercial ones. It includes all of the capabilities (if not the exact tools) from GATK 1.6 but none of the exclusive 2.0 tools.
For the tech-savvy, GATK-lite is the binary distribution corresponding to the public GATK source released on github. Everything in GATK-lite is licensed under the MIT license.
GATK-Lite includes the entire GATK map/reduce programming framework, all associated GATK libraries, and the vast majority of GATK tools. For example, most of the Unified Genotyper, all of Indel Realigner, CombineVariants, SelectVariants, and VariantEval are all part of GATK-Lite. All of GATK-Queue is also part of GATK-Lite.
The best way to think about GATK (implicitly -full) and GATK-lite is that both GATK and GATK-lite are GATK2, but -lite includes only the open sourced framework, library, and tools. GATK2 is effectively a superset of GATK-lite, built from the same source code, that includes a few closed-source premium tools.
GATK-Lite isn't a dead-end branch of GATK1. All GATK-Lite infrastructure will be fully supported -- to the same degree as GATK1 -- by the GSA team, as we will rely on these tools day-in and day-out. GATK-Lite is evolve in lock-step with the full GATK, GATK-Lite and GATK(-full) will carry the same release numbers, and will be pushed out by the GSA group simultaneously. As we add new file formats to the GATK (BCF2, for example) these changes will go into the core of GATK, and be available through both GATK and GATK-Lite.
I’m an academic researchers and I’d like to use GATK 2.0, what do I need to do?
Just download the full GATK jar from our new website, after agreeing to the GATK 2.0 license terms.
I run a genome sequencing center / NGS core facility at an academic institution, can I install GATK 2.0 for my users to run?
Yes, you can upgrade to GATK 2.0, assuming your institution is an academic non-commercial entity. The GATK 2.0 beta license explicitly allows the installation in a pipeline.
I work in a clinical lab at a hospital, and we use GATK 1.0 tools in our diagnostics lab, can I use GATK 2.0?
Yes and no. If the lab engages solely in non-commercial activities (such as clinical research) then yes. If the lab “sells” diagnostic services to health care providers then no, you’ll need to wait until the commercial license is available to upgrade. If the lab does a mix of both, you are welcome to provide a 2.0 pipeline for non-commercial uses while maintaining a 1.0 pipeline for commercial users, until you can upgrade to the commercial license.
I work in a government facility, can I use GATK 2.0?
Yes, the GATK development was and is supported by U.S. federal research grants that entitle government researchers to use the GATK.
I work in a commercial entity (i.e., for-profit company), can I use GATK 2.0?
No, you’ll need to wait until a commercial license version is available later this year. The only exception is answered in the next question.
Even though I work for a commercial entity, I’d like to evaluate GATK 2.0 in a non-production way; is that permitted under the GATK 2.0 beta?
Yes, we view evaluating and exploring the new tools in the GATK 2.0 as “research,” but this requires a special license that can only be obtained from the Broad’s business development office. Please contact Issi Rozen (irozen@) to obtain this commercial evaluation license. Note that this license is only valid until the full commercial release which will likely have it’s own trial version for commercial entities.
I built a portal that provides access directly to GATK 2.0 tools, how does the new GATK 2.0 license affect me?
As the GATK 2.0 beta license forbids redistribution of GATK 2.0 tools, you must ensure that these tools are only accessible to users within your institution. You are welcome to install and provide access to tools in GATK-lite, though. GATK-lite contains all of the code -- with a completely non-restrictive MIT license -- available in the latest GATK 1.6 release.
We are actively interested in defining reasonable use terms for third-party pipelines, so please contact Mark DePristo (depristo@) to discuss the matter further.
I work at an academic non-commercial institution, and I built a NGS pipeline that runs GATK tools on sequencing data. We often distribute BAMs and VCFs processed by GATK to our collaborators both within and external to the institution, how does the new GATK 2.0 license affect me?
Very little. The GATK 2.0 license allows academic non-commercial institutions to install, run, and distribute GATK-based results. Commercial institutions, however, are not permitted to use the GATK 2.0 beta, so can only do this using the unrestricted GATK-lite distribution. Note that when the commercial version of the GATK becomes available in late 2012, commercial institutions will have the opportunity to run and distribute the results of GATK 2.0 tools.
I work at an academic institution, and we conduct sponsored research on behalf of commercial entities, how does the new GATK 2.0 license affect me? The GATK 2.0 licence explicitly allows an academic non-commercial entity to run GATK2.0 as part of sponsored research projects.
I make NGS instruments and have embedded GATK in my instrument, how does the new GATK 2.0 license affect me?
The current GATK 2.0 license forbids redistribution of GATK 2.0 binaries, so you will not be able to download GATK 2.0 and redistribute it on your instruments. Of course, you are welcome to redistribute GATK-lite. We envision that instrument manufactures will be able to purchase a commercial, redistribution license when the the full commercial GATK version is available in late 2012.
We are actively interested in defining reasonable use terms for instrument manufacturers who want to embed GATK, so please contact Mark DePristo (depristo@) to discuss the matter further.
I compile and distribute an NGS analysis suite that includes the GATK. How does the new GATK 2.0 license affect me?
The new GATK 2.0 license forbids redistribution of the GATK 2.0 tools. You will not be able to include GATK 2.0 tools in your distribution going forward. You are welcome to install and distribute the tools GATK-lite, though, which is effectively what you have been doing with GATK 1.6. We are actively interested in defining reasonable use terms for third-party redistribution of the premium GATK2 suite, so please contact Mark DePristo (depristo@) to discuss the matter further.
I took the GATK and rewrote parts of the engine or individual tools to make them faster or better. How does the new 2.0 license affect me?
Very little. The GATK map/reduce programming framework and all associated libraries will continue to be available at github under the MIT license. Any improvements you made to the framework will continue to be viable and can be made available to the community in any way you see fit (see below for additional details).
Most of the GATK 1.6 tools remains in the new GATK-lite distribution on github as well, so improvements to any of those tools will remain valuable, and can be redistributed freely as the GATK-lite tools all have an MIT license.
Applying your engine optimizations to the new, protected GATK 2.0 tools can only be explored through a formal collaboration with the GATK team.
I built several NGS tools on top of the GATK, how does the new GATK 2.0 license affect me?
The GATK map/reduce programming framework and all associated libraries will continue to be available at github under the MIT license (i.e., the distribution known as GATK-Lite). We recommend you use and redistribute the GATK framework along with any independently written tools in any way and under any license you choose via the GATK-Lite github distribution mechanism. Several Broad Institute tools from the Cancer Genome Analysis team are distributed in just this way.
See questions about GATK-Lite for more details as well, as this covers the framework and many associated tools in more detail.
What features will the commercial version of the GATK have?
The commercial version of the GATK aims to be a slower evolving, better documented, and better supported version of the one released by the GSA team at the Broad. This means that the commercial version will not contain all of the bleeding edge features of the non-commercial GATK release. Moreover, the commercial version will not contain any significant additional features not available in the non-commercial version.
That said, the commercial version of the GATK will come with vastly better support than the non-commercial version. This includes
Will the commercial version include features not available in the non-commercial of the GATK? No. See “What will be the relationship between the commercial and GSA released version of the GATK?” below for more information
Will I be able to purchase the commercial version of the GATK even if I work in a non-commercial entity? Yes, you can. With the commercial version you will gain access to the improved documentation and support.
Why did you decide to restrict the GATK 2.0 beta to non-commercial entities? Because commercial entities will be required to purchase a software license to use the full GATK 2.0 suite.
When will the commercial version be available? In late 2012.
How much will the commercial version of the GATK cost? The specific pricing model has yet to be determined.
What will be the relationship between the commercial and GSA released version of the GATK? The GSA team will continue to release GATK versions on the 6-8 week timeframe, following the standard 2.0, 2.1, 2.2, etc. version convention. These will be available to non-commercial entities. The commercial version will evolve at a slower rate and aggregate many GSA GATK versions into larger commercial releases with much more extensive configuration and use documentation. Unlike the GSA released versions, where the current release is the only supported one, the commercial versions will each be supported for much longer periods of time.
Why did you decide to make parts of the GATK closed source? About a year ago we started to develop the tools that ultimately became part of the binary-only GATK 2.0, including BQSRv2, the advanced UG modules, ReducedReads, and the HaplotypeCaller. From the start these tools were kept private to the master GATK repository, as they were all completely unstable, unusable, and unpublished. As they have evolved into their now usable forms we wanted to share these tools with the community as soon as possible, before any papers, patents, or other forms of intellectual property protection were in place. Releasing binary versions allows us to share our capabilities early while ensuring some IP protection.
Additionally, a closed source model allows us more flexibility in the software licensing terms we enforce with the GATK 2.0. In particular reserving a subset of closed source tools protected by a non-commercial use license allows us to ensure that the research community has access to GATK tools as quickly as possible while preserving the value of a version of GATK licensed to commercial entities.
What GATK source code is available through github? The released GATK source code on github includes the latest GATK programming framework and most GATK tools, including everything from GATK 1.x, as well as associated test files and build scripts. The only material not pushed up to github from the master GATK repository are private tools (not shared in source or binary form) and protected tools (available in binary form only).
Can I get a copy of the source code for the new GATK 2.0 tools like ReducedReads? No, the source code for some of the new GATK 2.0 tools is not being released, and are only available in binary form (i.e., as compiled java JVM instructions).
Are you open to collaboration to obtain access to GATK 2.0 tool source code? Yes, several long-term close collaborators have access to the full GATK repository with public and private libraries. Please contact Mark DePristo (depristo@) to discuss this possibility further.
The GATK makes use of open source libraries, how do you comply with their license restrictions in a closed source GATK? We provide, upon request, a GenomeAnalysisTK.jar file built without pre-packaging any of our dependencies, which can be used to independently link the master GATK jar to any version of our dependencies.
Will you ever make the new GATK 2.0 tools open source? Yes, over time we plan to migrate closed source tools into the open source branch of the GATK.
What source is included with GATK-Lite GATK-Lite is basically everything in GATK 1.6, including the entire GATK programming framework and all associated libraries. It also includes the vast majority of tools in the GATK -- only a few select, premium analysis tools like the HaplotypeCaller are only in the full version of the GATK. Improves to the GATK framework, libraries, and lite tools will all continue to be developed and released as part of the GATK-lite distribution.
I am running GATK 2.7.2 with Java 1.7 to calculate the Depth of Coverage using the command -
java -Xmx6100m -XX:ParallelGCThreads=1 -jar /projects/GenomeAnalysisTK-2.7-4-g6f46d11//GenomeAnalysisTK.jar -nct 1 -nt 1 -T DepthOfCoverage -I /scratch/AID1234.Improved.bam -L /projects/nimblv2.EXOME.interval_list -R /projects/fasta/Human_GRC_build37_1kGproject/human_g1k_v37_Ensembl_MT_66.fasta -dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable --minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 --includeRefNSites --countType COUNT_FRAGMENTS -o /scratch/AID1234.txt
But when I check the details of the job with top command, I can see that a total of 10 cores are being used instead of 1.
Can someone tell me what I am doing wrong and how to overcome the problem.
If I have a bam file with three different read groups, and use SplitSamFile to split it like so:
java -Xmx2g -jar $GATKJAR -T SplitSamFile -I $INBAM -R $GENOME --outputRoot $PROJD/$IND/
Each of the output bam files have all three read groups. Is that the intended behavior? I would like each file to have only it's own read group info in the heads. Sorry for the bash arguments in the code above, is makes in readable at least.
We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.
We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:
scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.
In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.
A reverse example:
scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.
Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.
I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.
I have been using GATK (v2.2) UnifiedGenotyper to generate VCFs. I did a multisample realignment around indels which generated a multisample BAM of size ~500Gb. After looking at some of the SNP calls I decided to try removing duplicates. I used MarkDuplicates with "REMOVE_DUPLICATES=true" and although only 10% of reads were duplicates, the BAM reduced to ~75Gb. This did not seem to affect the depth of reads at a site more than the expected ~10% but now the AD field in the genotype columns is missing. ie GT:AD:GQ 0/1:.:30 When I run UnifiedGenotyper with the old BAM prior to MarkDuplicates the AD field is present.
I am currently running the MarkDuplicates on each sample prior to realignment - because I think this makes the most sense, but isn't clear why this should matter,
Any ideas on what is happening here?
We are keen to know more regarding the licensing arrangements for GATK 2.0, specifically availability and cost. I understand not all arrangements are yet in place, however we are eager to receive any updated information you may have. Please may I request contact information so that we may continue this discussion via telephone.
Thank you for your assistance. Kind regards,
PopulationGenetics Cambridge UK www.populationgenetics.com
I am using GATK v2 (GenomeAnalysisTK-2.0-0-g4c0ffd4) and was trying out the new
BaseRecalibrator walker. According to this post the
BaseRecalibrator should output "A PDF file containing quality control plots showing the patterns of recalibration of the data", however I do not have any such file. Both the
PrintReads steps of the BQSR pipeline appear to have worked as I have a recalibrated BAM file and the accompanying
GATKReport but I would like to be able to view plots of the recalibration process (and preferably have these generated automatically by the recalibration pipeline).
Can you please help? Thanks
This discussion was created from comments split from: GATK 2.0 announcement.
This was done because the new licensing and mixed open source model has turned out to be the object of much debate. We want to encourage discussion on this topic without obscuring the GATK 2.0 announcement thread, which is dedicated more so to the GATK 2.0 software itself, particularly the new tools and improvements.
So, have at it!