Please note that GATK-Lite was retired in February 2013 when version 2.4 was released. See the announcement here.
You probably know by now that GATK-Lite is a free-for-everyone and completely open-source version of the GATK (licensed under the original MIT license).
But what's in the box? What can GATK-Lite do -- or rather, what can it not do that the full version (let's call it GATK-Full) can? And what does that mean exactly, in terms of functionality, reliability and power?
To really understand the differences between GATK-Lite and GATK-Full, you need some more information on how the GATK works, and how we work to develop and improve it.
As explained here, the engine handles all the common work that's related to data access, conversion and traversal, as well as high-performance computing features. The engine is supported by an infrastructure of software libraries. If the GATK was a car, that would be the engine and chassis. What we call the **tools* are attached on top of that, and they provide the various analytical and processing functionalities like variant calling and base or variant recalibration. On your car, that would be headlights, airbags and so on.
We do all our development work on a single codebase. This means that everything --the engine and all tools-- is on one common workbench. There are not different versions that we work on in parallel -- that would be crazy to manage! That's why the version numbers of GATK-Lite and GATK-Full always match: if the latest GATK-Full version is numbered 2.1-13, then the latest GATK-Lite is also numbered 2.1-13.
The most important consequence of this setup is that when we make improvements to the infrastructure and engine, the same improvements will end up in GATK Lite and in GATK Full. So for the purposes of power, speed and robustness of the GATK that is determined by the engine, there is no difference between them.
For the tools, it's a little more complicated -- but not much. When we "build" the GATK binaries (the
.jar files), we put everything from the workbench into the Full build, but we only put a subset into the Lite build. Note that this Lite subset is pretty big -- it contains all the tools that were previously available in GATK 1.x versions, and always will. We also reserve the right to add previews or not-fully-featured versions of the new tools that are in Full, at our discretion, to the Lite build.
We have a new tool that performs a brand new function (which wasn't available in GATK 1.x), and we only include it in the Full build.
We have a tool that has some new add-on capabilities (which weren't possible in GATK 1.x); we put the tool in both the Lite and the Full build, but the add-ons are only available in the Full build.
Reprising the car analogy, GATK-Lite and GATK-Full are like two versions of the same car -- the basic version and the fully-equipped one. They both have the exact same engine, and most of the equipment (tools) is the same -- for example, they both have the same airbag system, and they both have headlights. But there are a few important differences:
The GATK-Full car comes with a GPS (sat-nav for our UK friends), for which the Lite car has no equivalent. You could buy a portable GPS unit from a third-party store for your Lite car, but it might not be as good, and certainly not as convenient, as the Full car's built-in one.
Both cars have windows of course, but the Full car has power windows, while the Lite car doesn't. The Lite windows can open and close, but you have to operate them by hand, which is much slower.
The underlying engine is exactly the same in both GATK-Lite and GATK-Full. Most functionalities are available in both builds, performed by the same tools. Some functionalities are available in both builds, but they are performed by different tools, and the tool in the Full build is better. New, cutting-edge functionalities are only available in the Full build, and there is no equivalent in the Lite build.
We hope this clears up some of the confusion surrounding GATK-Lite. If not, please leave a comment and we'll do our best to clarify further!
If you never got the point of GATK Lite and you hated the 2.0 license... Oh, do we have good news for you!
First, a little bit of context. When we released GATK 2.0, the GATK had emerged as the leading research software package in its domain. Public demand for tech support was rising rapidly; not only from the academic research community as it had in the past, but also from researchers using the software in a for-profit context. These latter users have specific needs (quality assurance, process certifications, etc.) that we are ill-equipped to address.
This drove us to seek a partnership with a company called Appistry which could release and license the GATK as a commercial software product appropriate for use in a for-profit and regulatory-compliant setting. We knew this solution would better meet customer needs, while alleviating our support burden and allowing us to focus on our core constituency, the academic and non-profit research community. This plan also had the prospective benefit of leveraging the intellectual property of the GATK (much of which results more or less directly from public investments) to fund the continuation of our research and development activities.
However we knew it would take us and our partners at Appistry some time to develop a mature commercial product. So as an interim solution, we enacted a more restrictive license, closed part of the source code on the “Full GATK” release, and provided a “Lite” version to enable for-profit users to keep working with an up-to-date version of the GATK (albeit without the cutting-edge tools that were introduced in version 2.0). Of course, the GATK programming framework (the GATK engine, libraries, and basic data management tools) continued to remain open source under the MIT license.
Well, we got a lot of feedback from the user community over these changes. We listened carefully, took the criticism to heart, and realized our interim solution left much to be desired. First, closing part of the source code was a deeply unpopular move. Many of you pointed out that this might restrict academic knowledge and obstruct progress in the field of algorithmic research. Second, we did a poor job of communicating the purpose of Lite and how it differed from the Full version. Even though Lite was always intended as an interim solution, some organizations opted to adopt it instead of the Full version and seem to view it as a viable long-term solution for genetic analysis. Related to this, we found that maintaining the two different distributions gave us our share of headaches in terms of supporting and updating the toolkit.
In light of these considerations, we’re going to change things up again, hopefully for the better!
In a nutshell: no more Lite and a new license (attached) that restores free access to the source code for those in the community performing academic non-commercial research. That’s right, free as in beer! You’ll still have the option of downloading the packaged binary (i.e., the “ready-to-run” program) from our website as you did before, but you’ll also be able to get the full source code (programming framework AND all tools including the latest and greatest) straight from the Github repository if you want. You can set it up on a server and provide it as a service to other non-profit users within your organization. You can dig into our deepest secrets to find out what makes ReduceReads and the HaplotypeCaller tick. And feel free to send us patches if you find a way to improve the code!
Licensed users through Appistry, in addition to having access to the full GATK and the added benefits of a fully-fledged commercial solution (less buggy, more help-y), may optionally purchase access to the source code. Appistry has been fine-tuning its process for providing the commercial product (including enterprise-grade QA, which we don’t do) as well as training a professional support team. If your use of GATK requires a commercial license, we encourage you to reach out to them. Appistry will be able to handle any questions you may have about the commercial release schedule, available support, and of course, licensing and pricing terms (whether for individual or site-wide licenses, companies big and small).
The following figure summarizes the different packages and their corresponding licenses.
Note that if you are using a version of GATK-Lite, you may continue using it, but we will be making no more updates to Lite after 2.3. Thus, if you choose to stay with Lite, you will be using an outdated version of the toolkit and you won’t benefit from any further improvements made to the GATK with the 2.4 release and in the future.
We welcome any and all comments on these new changes, which are due to take effect with the upcoming release of version 2.4 (tentatively scheduled for early February). There’s still time to tweak the language of the license if you spot any issues we’ve overlooked.
If you are using the GATK in an academic or non-profit research setting and have any questions or concerns about the details of the new license (attached), please join the discussion in the comments below. If you are using the GATK in a for-profit context, please contact our partners at Appistry as they will be in a better position to address your questions. If you’re not sure in which category you belong, please contact either Appistry or Issi Rozen at the Broad Institute.
This is not a bug per se in that it does not cause incorrect output, but I think it would be accurately described as an "unintended consequence" of very poorly compressed VCF output files.
GATK allows for output VCF files to be written using Picard's
BlockCompressedOutputStream when the the output file is specified with the extension
.vcf.gz, which I consider to be very good behavior. However, I noticed after doing some minor external manipulation that the files produced this way are "suboptimally" compressed. By suboptimal, I mean that sometimes the files are even larger than the uncompressed VCF files.
Since the problem occurs in GATK-Lite, I was able to look through the source code to see what is going on. From what I can tell, the issue is that
mWriter.flush() at the end of
VCFWriter.add() for each variant. Per the documentation for
WARNING: flush() affects the output format, because it causes the current contents of uncompressedBuffer to be compressed and written, even if it isn't full.
As a result, instead of the default of blocks of about 64k, the bgzf-formatted
.vcf.gz files produced by GATK have blocks for each line. That reduces the amount repetition for gzip to take advantage of. Not being sure what issues led to requiring a call to flush after every variant, I'm not sure how to best address this, but it may be necessary to wrap BlockCompressedOutputStream when used by VCFWriter to catch this flush in order to get effective compression.
Of course, it is possible to simply write the file and then compress it in a separate step, but this leads to disk IO that should be preventable.