GATK release 1.4
From GSA
Release date: December 31, 2011
Contents |
Unified Genotyper
- Implemented the proper QUAL calculation for multi-allelic calls (used previously only in GENOTYPE_GIVEN_ALLELES mode).
- The ability to genotype multiple alleles in discovery mode has been added for SNPs. See the --multiallelic argument documentation for more details. Please note that this is still highly experimental.
- Added an argument to specify the maximum number of alternate alleles to genotype at any one site.
- When making reference calls, don't output per-sample PLs (since they don't make sense without an alternate allele).
- Genotypes are now assigned with a greedy approach to achieve the target MLE, instead of a dynamic programming implementation based on the posterior probabilities.
- No-call genotypes are no longer annotated with garbage values.
- Bug fix for indels: skip likelihood computation in the case where too many bases were cut and there is no haplotype or read left.
- Bug fix for indels: catch out of bound error that can show up in RNAseq reads with long N CIGAR operators in the middle.
Variant Annotator
- Added an annotation to compute the TDT (Transmission Disequilibrium Test).
- Added an option to do strict allele matching when annotating with comp track concordance.
- QualByDepth now correctly skips no-call genotypes.
- All Rank Sum Tests were updated to be deterministic.
- Transfers headers from the resource VCF (when possible) when using expressions.
- Temporarily disabled SnpEff support in the GATK until SnpEff 2.0.4 has been officially released.
Variant Eval
- Better Mendelian Violation evaluation module (completely rewritten).
- Added a IntervalStratification module.
- Ti/Tv calculation no longer rounds values.
- Prints 0.0 Ti/Tv and not NaN when there are no variants.
- Performance optimizations throughout the tool.
- Tables now are cleanly formatted (floats are %.2f printed).
- VariantSummary is a standard report now.
- Removed CompEvalGenotypes (it didn't do anything) and SimpleMetricsByAC modules.
Select Variants
- Fixed bug that occurred when multiple records were present at a given position, where all variants that follow after one that fails filters are dropped (instead of dropping just the failing one).
- Added support for selecting only variants with specific IDs from a file.
- Added support for running multi-threaded (i.e. with the -nt argument).
- Fixed incorrect performance when using a discordance track but no sample specifications.
Combine Variants
- Implemented a better algorithm for merging genotypes.
Indel Realigner and Realigner Target Creator
- No longer changes the mapping quality of MQ=255 reads that get realigned.
- Fixed bug where the GATK holds onto too much memory when running the Realigner Target Creator with multiple threads.
Variant Quality Score Recalibrator
- Now detects if the negative model failed to converge properly because of having too few data points and automatically retries with more appropriate clustering parameters.
Variant Context
- Fixed bug for complex records: we can no longer clip out a complete allele.
- We now fail gracefully when encountering malformed VCFs without enough data columns.
Misc Tools and Features
- We now check for reads with missing read groups and throw a UserException when encountered.
- Rarely (if ever) used Read Pair traversal type is no longer supported.
- We no longer allow the '-L "interval1;interval2"' syntax; use '-L interval1 -L interval2' instead.
- Fixed a bug in the automatic clipping of adapter sequence in LocusWalkers.
- Fixed a performance issue with RODWalker parallelization.
- Performance improvements to the general GATK engine, especially when working with many input BAM files.
- Added a new tool called Phase By Transmission to perform phasing of variants based on family structure.
- Added a new tool called Validation Site Selector that helps the user choose a subset of sites to use for a validation of the overall callset.
- Fix for problem with too many open file handles: actually close file handles when close() is called.
- Fix for problem when using Bed files as intervals: one no longer needs to explicitly state the file type.
- Traversal statistics start printing after 30 seconds instead of 2 minutes.
- Revved Picard jar to version 1.57.1030.
- Revved Tribble jar to version 46.
This includes up to commit 472fc94f003d7043614639dfde29b38cb7a9d439.
For GATK developers: new VariantContext interface
GenotypesContext: an efficient representation for a collection of Genotype objects
- Old version was a Map<String, Genotype>. New version is a GentotypesContext. Implements List<Genotype> interface, but also has efficient random access (getSample(String sampleName), for example)
- Static constructor interface (GenotypesContext.create() and GenotypesContext.copy()) is used to create:
// create an empty GC GenotypesContext gc1 = GenotypesContext.create(); // create an GC containing a copy of the GenotypesContext from the VariantContext GenotypesContext gc2 = GenotypesContext.copy(variantContext.getGenotypes());
VariantContexts can now only be built with VariantContextBuilder
- Old approach had many overloaded constructors. Lots of code duplication of creating empty data structures for missing arguments. What we really need is a clean programming approach to creating a VariantContext. VariantContextBuilder follows Java builder design to easily create VariantContexts:
// create an empty builder with all default values
VariantContextBuilder def = new VariantContextBuilder();
// create a VCB with the min. amount of information to make a VariantContext, and set the log10PError
VariantContextBuilder builder = new VariantContextBuilder("name", loc.getContig(), loc.getStart(), loc.getStop(), alleles).log10PError(0.0);
// create a new variant context builder based on the old one
VariantContextBuilder builder = new VariantContextBuilder(vc);
// create a new variant context based on the old one
VariantContext copy = new VariantContextBuilder(vc).make();
// create a new variant context based on the old one with new genotypes
VariantContextBuilder sub = new VariantContextBuilder(vc).genotypes(genotypes);
A non-exhaustive list of other important changes
- The annoying representation of probabilities as -1 * log10(p) is gone. All code uses log10(p) instead
- Massive UnitTests everywhere are ensuring correctness. Many edge-case bugs are fixed
- Underlying GenotypesContext and associated cleanups have speed up the code significantly. For example, working with HM3 and OMNI genotypes VCFs, we see:
- CombineVariants is 33% faster with the new code
- SelectVariants for NA12878 is 33% faster
- VariantEval with -ST Sample on theOMNI file is 33% faster
- isPolymorphic is now isPolymorphicInSamples (same for isMonomorphic)
- getChromosomeCount is now getCalledChrCount to reflect that is now treats genotype A/. as 1 called chromosome
- subContextFromGenotypes completed removed in favor of cleaner subContextFromSamples
- ID field is a first class object in VariantContext -- no longer stored in the attributes
- MutableVariantContext and MutableGenotype removed as they never were really used.