Sorry, there are no publicly available documents of this type with the tag #release-notes. Try one of the other types.
The GATK 2.0 release includes both the addition of brand-new (and often still experimental) tools and updates to the existing stable tools.
New Tools
- Base Recalibrator (BQSR v2), an upgrade to CountCovariates/TableRecalibration that generates base substitution, insertion, and deletion error models.
- Reduce Reads, a BAM compression algorithm that reduces file sizes by 20x-100x while preserving all information necessary for accurate SNP and indel calling. ReduceReads enables the GATK to call tens of thousands of deeply sequenced NGS samples simultaneously.
- HaplotypeCaller, a multi-sample local de novo assembly and integrated SNP, indel, and short SV caller.
- Plus powerful extensions to the Unified Genotyper to support variant calling of pooled samples, mitochondrial DNA, and non-diploid organisms. Additionally, the extended Unified Genotyper introduces a novel error modeling approach that uses a reference sample to build a site-specific error model for SNPs and indels that vastly improves calling accuracy.
Base Quality Score Recalibration
- IMPORTANT: the Count Covariates and Table Recalibration tools (which comprise BQSRv1) have been retired! Please see the BaseRecalibrator tool (BQSRv2) for running recalibration with GATK 2.0.
Unified Genotyper
- Handle exception generated when non-standard reference bases are present in the fasta.
- Bug fix for indels: when checking the limits of a read to clip, it wasn't considering reads that may already have been clipped before.
- Now emits the MLE AC and AF in the INFO field.
- Don't allow N's in insertions when discovering indels.
Phase By Transmission
- Multi-allelic sites are now correctly ignored.
- Reporting of mendelian violations is enhanced.
- Corrected TP overflow.
- Fixed bug that arose when no PLs were present.
- Added option to output the father's allele first in phased child haplotypes.
- Fixed a bug that caused the wrong phasing of child/father pairs.
Variant Eval
- Improvements to the validation report module: if eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.
- If present, the AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC).
- Fixed bugs in the VariantType and IndelSize stratifications.
Variant Annotator
- FisherStrand annotation no longer hard-codes in filters for bases/reads (previously used MAPQ > 20 && QUAL > 20).
- Miscellaneous bug fixes to experimental annotations.
- Added a Clipping Rank Sum Test to detect when variants are present on reads with differential clipping.
- Fixed the ReadPos Rank Sum Test annotation so that it no longer uses the un-hardclipped start as the alignment start.
- Fixed bug in the NBaseCount annotation module.
- The new TandemRepeatAnnotator is now a standard annotation while HRun has been retired.
- Added PED support for the Inbreeding Coefficient annotation.
- Don't compute QD if there is no QUAL.
Variant Quality Score Recalibration
- The VCF index is now created automatically for the recalFile.
Variant Filtration
- Now allows you to run with type unsafe JEXL selects, which all default to false when matching.
Select Variants
- Added an option which allows the user to re-genotype through the exact AF calculation model (if PLs are present) in order to recalculate the QUAL and genotypes.
Combine Variants
- Added --mergeInfoWithMaxAC argument to keep info fields from the input with the highest AC value.
Somatic Indel Detector
- GT header line is now output.
Indel Realigner
- Automatically skips Ion reads just like it does with 454 reads.
Variants To Table
- Genotype-level fields can now be specified.
- Added the --moltenize argument to produce molten output of the data.
Depth Of Coverage
- Fixed a NullPointerException that could occur if the user requested an interval summary but never provided a -L argument.
Miscellaneous
- BCF2 support in tools that output VCFs (use the .bcf extension).
- The GATK Engine no longer automatically strips the suffix "Walker" after the end of tool names; as such, all tools whose name ended with "Walker" have been renamed without that suffix.
- Fixed bug when specifying a JEXL expression for a field that doesn't exist: we now treat the whole expression as false (whereas we were rethrowing the JEXL exception previously).
- There is now a global --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends).
- Removed all code associated with extended events.
- Algorithmically faster version of DiffEngine.
- Better down-sampling fixes edge case conditions that used to be handled poorly. Read Walkers can now use down-sampling.
- GQ is now emitted as an int, not a float.
- Fixed bug in the Beagle codec that was skipping the first line of the file when decoding.
- Fixed bug in the VCF writer in the case where there are no genotypes for a record but there are genotypes in the header.
- Miscellaneous fixes to the VCF headers being produced.
- Fixed up the BadCigar read filter.
- Removed the old deprecated genotyping framework revolving around the misordering of alleles.
- Extensive refactoring of the GATKReports.
- Picard jar updated to version 1.67.1197.
- Tribble jar updated to version 110.
Base Quality Score Recalibration
- Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
- Implemented support for SOLiD no call strategies other than throwing an exception.
- Fixed smoothing in the BQSR bins.
- Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.
Unified Genotyper
- Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
- UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
- Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
- In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
- Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.
Haplotype Caller
- Added LowQual filter to the output when appropriate.
- Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
- Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
- Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
- Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
- Fixed bug where non-standard bases from the reference would cause errors.
- Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.
Reduce Reads
- Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
- Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
- Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.
Variant Eval
- Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
- Fixed incorrect allele counting in IndelSummary evaluation.
Combine Variants
- Now outputs the first non-MISSING QUAL, instead of the maximum.
- Now supports multi-threaded running (with the -nt argument).
Select Variants
- Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
- No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
- If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).
Miscellaneous
- Updated and improved the BadCigar read filter.
- GATK now generates a proper error when a gzipped FASTA is passed in.
- Various improvements throughout the BCF2-related code.
- Removed various parallelism bottlenecks in the GATK.
- Added support of X and = CIGAR operators to the GATK.
- Catch NumberFormatExceptions when parsing the VCF POS field.
- Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
- Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
- We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
- Added support for handling complex events in ValidateVariants.
- Picard jar remains at version 1.67.1197.
- Tribble jar remains at version 110.
GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
Base Quality Score Recalibration
- Improved the algorithm around homopolymer runs to use a "delocalized context".
- Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
- Fixed bug where the tool failed for reads that begin with insertions.
- Fixed bug in the scatter-gather functionality.
- Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).
Unified Genotyper
- Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
- The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
- Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
- Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
- Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
- Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
- Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
- Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
- The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
- Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
- Generalized ploidy model now handles reference calls correctly.
Haplotype Caller
- Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
- Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
- Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
- Now requires at least 10 samples to merge variants into complex events.
Variant Annotator
- Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.
Reduce Reads
- Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
- Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
- Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.
Variant Filtration
- Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.
Variant Eval
- AlleleCount stratification now supports records with ploidy other than 2.
Combine Variants
- Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
- Now outputs the first non-missing QUAL, not the maximum.
Select Variants
- Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
- Removed the -number argument because it gave biased results.
Validate Variants
- Added option to selectively choose particular strict validation options.
- Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
- improved the error message around unused ALT alleles.
Somatic Indel Detector
- Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).
Miscellaneous
- New CPU "nano" parallelization option (-nct) added GATK-wide (see docs for more details about this cool new feature that allows parallelization even for Read Walkers).
- Fixed raw HapMap file conversion bug in VariantsToVCF.
- Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
- Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
- Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
- Fixed bug in BCF2 writer for case where all genotypes are missing.
- Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
- Fixed bug in Phase By Transmission when there are no likelihoods present.
- Fixed bug in fasta .fai generation.
- Updated and improved version of the BadCigar read filter.
- Picard jar remains at version 1.67.1197.
- Tribble jar remains at version 110.
GATK 2.3 was released on December 17, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
Base Quality Score Recalibration
- Soft clipped bases are no longer counted in the delocalized BQSR.
- The user can now set the maximum allowable cycle with the --maximum_cycle_value argument.
Unified Genotyper
- Minor (5%) run time improvements to the Unified Genotyper.
- Fixed bug for the indel model that occurred when long reads (e.g. Sanger) in a pileup led to a read starting after the haplotype.
- Fixed bug in the exact AF calculation where log10pNonRefByAllele should really be log10pRefByAllele.
Haplotype Caller
- Fixed the performance of GENOTYPE_GIVEN_ALLELES mode, which often produced incorrect output when passed complex events.
- Fixed the interaction with the allele biased downsampling (for contamination removal) so that the removed reads are not used for downstream annotations.
- Implemented minor (5-10%) run time improvements to the Haplotype Caller.
- Fixed the logic for determining active regions, which was a bit broken when intervals were used in the system.
Variant Annotator
- The FisherStrand annotation ignores reduced reads (because they are always on the forward strand).
- Can now be run multi-threaded with -nt argument.
Reduce Reads
- Fixed bug where sometime the start position of a reduced read was less than 1.
- ReduceReads now co-reduces bams if they're passed in toghether with multiple -I.
Combine Variants
- Fixed the case where the PRIORITIZE option is used but no priority list is given.
Phase By Transmission
- Fixed bug where the AD wasn't being printed correctly in the MV output file.
Miscellaneous
- A brand new version of the per site down-sampling functionality has been implemented that works much, much better than the previous version.
- More efficient initial file seeking at the beginning of the GATK traversal.
- Fixed the compression of VCF.gz where the output was too big because of unnecessary call to flush().
- The allele biased downsampling (for contamination removal) has been rewritten to be smarter; also, it no longer aborts if there's a reduced read in the pileup.
- Added a major performance improvement to the GATK engine that stemmed from a problem with the NanoSchedule timing code.
- Added checking in the GATK for mis-encoded quality scores.
- Fixed downsampling in the ReadBackedPileup class.
- Fixed the parsing of genome locations that contain colons in the contig names (which is allowed by the spec).
- Made ID an allowable INFO field key in our VCF parsing.
- Multi-threaded VCF to BCF writing no longer produces an invalid intermediate file that fails on merging.
- Picard jar remains at version 1.67.1197.
- Tribble jar updated to version 119.
GATK 2.4 was released on February 26, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
Important note 1 for this release: with this release comes an updated licensing structure for the GATK. Different files in our public repository are protected with different licenses, so please see the text at the top of any given file for details as to its particular license.
Important note 2 for this release: the GATK team spent a tremendous amount of time and engineering effort to add extensive tests for many of our core tools (a process that will continue into future releases). Unsurprisingly, as part of this process many small (and some not so small) bugs were uncovered during testing that we subsequently fixed. While we usually attempt to enumerate in our release notes all of the bugs fixed during a given release, that would entail quite a Herculean effort for release 2.4; so please just be aware that there were many smaller fixes that may be omitted from these notes.
Base Quality Score Recalibration
- The underlying calculation of the recalibration has been improved and generalized so that the empirical quality is now calculated through a Bayesian estimate. This radically improves the accuracy in particular for bins with small numbers of observations.
- Added many run time improvements so that this tool now runs much faster.
- Print Reads writes a header when used with the -BQSR argument.
- Added a check to make sure that BQSR is not being run on a reduced bam (which would be bad).
- The --maximum_cycle_value argument can now be specified during the Print Reads step to prevent problems when running on bams with extremely long reads.
- Fixed bug where reads with an existing BQ tag and soft-clipped bases could cause the tool to error out.
Unified Genotyper
- Fixed the QUAL calculation for monomorphic (homozygous reference) sites (the math for previous versions was not correct).
- Biased downsampling (i.e. contamination removal) values can now be specified as per-sample fractions.
- Fixed bug where biased downsampling (i.e. contamination removal) was not being performed correctly in the presence of reduced reads.
- The indel likelihoods calculation had several bugs (e.g. sometimes the log likelihoods were positive!) that manifested themselves in certain situations and these have all been fixed.
- Small run time improvements were added.
Haplotype Caller
- Extensive performance improvements were added to the Haplotype Caller. This includes run time enhancements (it is now much faster than previous versions) plus improvements in accuracy for both SNPs and indels. Internal assessment now shows the Haplotype Caller calling variants more accurately than the Unified Genotyper. The changes for this tool are so extensive that they cannot easily be enumerated in these notes.
Variant Annotator
- The QD annotation is now divided by the average length of the alternate allele (weighted by the allele count); this does not affect SNPs but makes the calculation for indels much more accurate.
- Fixed Fisher Strand annotation where p-values sometimes summed to slightly greater than 1.0.
- Fixed Fisher Strand annotation for indels where reduced reads were not being handled correctly.
- The Haplotype Score annotation no longer applies to indels.
- Added the Variant Type annotation (not enabled by default) to annotate the VCF record with the variant type.
- The DepthOfCoverage annotation has been renamed to Coverage.
Reduce Reads
- Several small run time improvements were added to make this tool slightly faster.
- By default this tool now uses a downsampling value of 40x per start position.
Indel Realigner
- Fixed bug where some reads with soft clipped bases were not be realigned.
Combine Variants
- Run time performance improvements added where one uses the PRIORITIZE or REQUIRE_UNIQUE options.
Select Variants
- The --regenotype functionality has been removed from SelectVariants and transferred into its own tool: RegenotypeVariants.
Variant Eval
- Removed the GenotypeConcordance evaluation module (which had many bugs) and converted it into its own tested, standalone tool (called GenotypeConcordance).
Miscellaneous
- The VariantContext and related classes have been moved out of the GATK codebase and into Picard's public repository. The GATK now uses the variant.jar as an external library.
- Added a new Read Filter to reassign just a particular mapping quality to another one (see the ReassignOneMappingQualityFilter).
- Added the Regenotype Variants tool that allows one to regenotype a VCF file (which must contain likelihoods in the PL field) after samples have been added/removed.
- Added the Genotype Concordance tool that calculates the concordance of one VCF file against another.
- Bug fix for VariantsToVCF for records where old dbSNP files had '-' as the reference base.
- The GATK now automatically converts IUPAC bases in the reference to Ns and errors out on other non-standard characters.
- Fixed bug for the DepthOfCoverage tool which was not counting deletions correctly.
- Added Cat Variants, a standalone tool to quickly combine multiple VCF files whose records are non-overlapping (e.g. as produced during scatter-gather).
- The Somatic Indel Detector has been removed from our codebase and moved to the Broad Cancer group's private repository.
- Fixed Validate Variants rsID checking which wasn't working if there were multiple IDs.
- Picard jar updated to version 1.84.1337.
- Tribble jar updated to version 1.84.1337.
- Variant jar updated to version 1.85.1357.
GATK 2.5 was released on April 30, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
Reduce Reads
- DRASTIC improvements in the compression algorithm plus myriad bug fixes. Too many to list here; see detailed version history for more information.
Unified Genotyper
- Fixed bug for indel calling with really long reads (assigning the wrong genotypes).
- Automatic contamination fixing now works on reduced reads.
- Fixed rare bug in the general ploidy SNP likelihood model when there are no informative reads in a pileup.
- Fixed bug where haplotypes with 0 bases were being created.
- Fixed problem where our internal PairHMM was generating positive likelihoods.
Haplotype Caller
- Comprehensive performance improvements to the accuracy of calling both SNPs and indels; runtime is also much improved (but still slower than the Unified Genotyper; we expect it to be faster than UG in the next release though). See detailed version history for more information.
- Fixed bug for calling on reduced reads (counts were not being assigned correctly).
- Fixed problem where our internal PairHMM was generating positive likelihoods.
- Can now write BAMs showing the assembled haplotypes.
Diagnose Targets
- Significantly refactored this tool; it now works with a "plugin" system (see documentation for more information).
- Fixed bug where LOW_MEDIAN_COVERAGE was output when no reads are covering the interval.
- Fixed bug where intervals were skipped when they were not covered by any reads.
Base Recalibrator
- Fixed the tool to work correctly with empty BQSR tables.
- Fixed issue where Print Reads was running out of disk space when using the -BQSR option even for small bam files.
- Fixed bug for RNA seq alignments with Ns.
Select Variants
- Fixed bug where using the --exclude_sample_file argument was giving bad results.
- Fixed bug when using the --keepOriginalAC argument which caused it to emit bad VCFs.
- Fixed bug where maxIndelSize argument wasn't getting applied to deletions.
Variant Annotator
- Added support for snpEff "GATK compatibility mode".
- Can now list available annotations by doing
java -cp GenomeAnalysisTK.jar org.broadinstitute.sting.tools.ListAnnotations
- QualByDepth remaps QD values > 40 to a gaussian around 30.
- Removed several deprecated annotations (AverageAltAlleleLength, MappingQualityZeroFraction, and TechnologyComposition) and others are no longer marked as experimental.
Variant Filtration
- Don't allow users to specify keys and IDs that contain angle brackets or equals signs (which are not allowed in the VCF specification).
- Added feature that allows one to filter sites outside of a given mask.
Left Align Variants
- Renamed to LeftAlignAndTrimVariants.
- Added ability to trim common bases in front of indels before left-aligning.
- Added ability to split multiallelic records and then left align them.
Miscellaneous
- We removed the auto-creation of fai/dict files for fasta references because it was too buggy.
- Fixed bug where we could fail to find the intersection of unsorted/missorted interval lists.
- Fixed @PG tag uniqueness issue with BAMs we were producing.
- Fixed rare bug in GenotypeConcordance for multi-allelic sites.
- Added check for reads without stored bases (i.e. that use '*') which we do not support.
- Added support to reduce reads to CallableLoci.
- Added a new walker to split MNPs into their allelic primitives (SNPs).
- We no longer allow the use of compressed (.gz) references in the GATK.
- Picard/Tribble/Variant jars updated to version 1.90.1442.
Sorry, there are no publicly available documents of this type with the tag #release-notes. Try one of the other types.