GATK release 1.4

From GSA
Jump to: navigation, search

Release date: December 31, 2011

Contents

Unified Genotyper

  • Implemented the proper QUAL calculation for multi-allelic calls (used previously only in GENOTYPE_GIVEN_ALLELES mode).
  • The ability to genotype multiple alleles in discovery mode has been added for SNPs. See the --multiallelic argument documentation for more details. Please note that this is still highly experimental.
  • Added an argument to specify the maximum number of alternate alleles to genotype at any one site.
  • When making reference calls, don't output per-sample PLs (since they don't make sense without an alternate allele).
  • Genotypes are now assigned with a greedy approach to achieve the target MLE, instead of a dynamic programming implementation based on the posterior probabilities.
  • No-call genotypes are no longer annotated with garbage values.
  • Bug fix for indels: skip likelihood computation in the case where too many bases were cut and there is no haplotype or read left.
  • Bug fix for indels: catch out of bound error that can show up in RNAseq reads with long N CIGAR operators in the middle.


Variant Annotator

  • Added an annotation to compute the TDT (Transmission Disequilibrium Test).
  • Added an option to do strict allele matching when annotating with comp track concordance.
  • QualByDepth now correctly skips no-call genotypes.
  • All Rank Sum Tests were updated to be deterministic.
  • Transfers headers from the resource VCF (when possible) when using expressions.
  • Temporarily disabled SnpEff support in the GATK until SnpEff 2.0.4 has been officially released.


Variant Eval

  • Better Mendelian Violation evaluation module (completely rewritten).
  • Added a IntervalStratification module.
  • Ti/Tv calculation no longer rounds values.
  • Prints 0.0 Ti/Tv and not NaN when there are no variants.
  • Performance optimizations throughout the tool.
  • Tables now are cleanly formatted (floats are %.2f printed).
  • VariantSummary is a standard report now.
  • Removed CompEvalGenotypes (it didn't do anything) and SimpleMetricsByAC modules.

Select Variants

  • Fixed bug that occurred when multiple records were present at a given position, where all variants that follow after one that fails filters are dropped (instead of dropping just the failing one).
  • Added support for selecting only variants with specific IDs from a file.
  • Added support for running multi-threaded (i.e. with the -nt argument).
  • Fixed incorrect performance when using a discordance track but no sample specifications.


Combine Variants

  • Implemented a better algorithm for merging genotypes.


Indel Realigner and Realigner Target Creator

  • No longer changes the mapping quality of MQ=255 reads that get realigned.
  • Fixed bug where the GATK holds onto too much memory when running the Realigner Target Creator with multiple threads.


Variant Quality Score Recalibrator

  • Now detects if the negative model failed to converge properly because of having too few data points and automatically retries with more appropriate clustering parameters.


Variant Context

  • Fixed bug for complex records: we can no longer clip out a complete allele.
  • We now fail gracefully when encountering malformed VCFs without enough data columns.


Misc Tools and Features

  • We now check for reads with missing read groups and throw a UserException when encountered.
  • Rarely (if ever) used Read Pair traversal type is no longer supported.
  • We no longer allow the '-L "interval1;interval2"' syntax; use '-L interval1 -L interval2' instead.
  • Fixed a bug in the automatic clipping of adapter sequence in LocusWalkers.
  • Fixed a performance issue with RODWalker parallelization.
  • Performance improvements to the general GATK engine, especially when working with many input BAM files.
  • Added a new tool called Phase By Transmission to perform phasing of variants based on family structure.
  • Added a new tool called Validation Site Selector that helps the user choose a subset of sites to use for a validation of the overall callset.
  • Fix for problem with too many open file handles: actually close file handles when close() is called.
  • Fix for problem when using Bed files as intervals: one no longer needs to explicitly state the file type.
  • Traversal statistics start printing after 30 seconds instead of 2 minutes.
  • Revved Picard jar to version 1.57.1030.
  • Revved Tribble jar to version 46.

This includes up to commit 472fc94f003d7043614639dfde29b38cb7a9d439.


For GATK developers: new VariantContext interface

GenotypesContext: an efficient representation for a collection of Genotype objects

  • Old version was a Map<String, Genotype>. New version is a GentotypesContext. Implements List<Genotype> interface, but also has efficient random access (getSample(String sampleName), for example)
  • Static constructor interface (GenotypesContext.create() and GenotypesContext.copy()) is used to create:
// create an empty GC
GenotypesContext gc1 = GenotypesContext.create();

// create an GC containing a copy of the GenotypesContext from the VariantContext
GenotypesContext gc2 = GenotypesContext.copy(variantContext.getGenotypes());

VariantContexts can now only be built with VariantContextBuilder

  • Old approach had many overloaded constructors. Lots of code duplication of creating empty data structures for missing arguments. What we really need is a clean programming approach to creating a VariantContext. VariantContextBuilder follows Java builder design to easily create VariantContexts:
// create an empty builder with all default values
VariantContextBuilder def = new VariantContextBuilder();

// create a VCB with the min. amount of information to make a VariantContext, and set the log10PError
VariantContextBuilder builder = new VariantContextBuilder("name", loc.getContig(), loc.getStart(), loc.getStop(), alleles).log10PError(0.0);

// create a new variant context builder based on the old one
VariantContextBuilder builder = new VariantContextBuilder(vc);

// create a new variant context based on the old one
VariantContext copy = new VariantContextBuilder(vc).make();

// create a new variant context based on the old one with new genotypes
VariantContextBuilder sub = new VariantContextBuilder(vc).genotypes(genotypes);

A non-exhaustive list of other important changes

  • The annoying representation of probabilities as -1 * log10(p) is gone. All code uses log10(p) instead
  • Massive UnitTests everywhere are ensuring correctness. Many edge-case bugs are fixed
  • Underlying GenotypesContext and associated cleanups have speed up the code significantly. For example, working with HM3 and OMNI genotypes VCFs, we see:
    • CombineVariants is 33% faster with the new code
    • SelectVariants for NA12878 is 33% faster
    • VariantEval with -ST Sample on theOMNI file is 33% faster
  • isPolymorphic is now isPolymorphicInSamples (same for isMonomorphic)
  • getChromosomeCount is now getCalledChrCount to reflect that is now treats genotype A/. as 1 called chromosome
  • subContextFromGenotypes completed removed in favor of cleaner subContextFromSamples
  • ID field is a first class object in VariantContext -- no longer stored in the attributes
  • MutableVariantContext and MutableGenotype removed as they never were really used.
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox