Tagged with #release-notes 0 documentation articles | 14 announcements | 0 forum discussions

No posts found with the requested search criteria.

Created 2015-05-15 04:52:05 | Updated | Tags: haplotypecaller release-notes genotypegvcfs gatk3

GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights to be published soon.

Note that the release is in progress at time of posting -- it may take a couple of hours before the new GATK jar file is updated on the downloads page.

New tool

• ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.

HaplotypeCaller & GenotypeGVCFs

• Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
• Deprecated --mergeVariantsViaLD argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
• Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
• Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
• Fixed the implementation of allowNonUniqueKmersInRef so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
• Initialize annotations so that --disableDithering actually works.
• Automatic selection of indexing strategy based on .g.vcf file extension. See Highlights for more details.
• Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
• Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
• Removed GenotypeSummaries from default annotations.
• Added -uniquifySamples to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
• Disallow changing -dcov setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
• Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
• Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
• Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
• Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
• Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
• Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).

CombineGVCFs

• Added option to break blocks at every N sites. Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way -L works in the GATK for VCF records with the END tag). See Highlights for more details.
• Fixed a bug that caused the tool to stop processing after the first contig.
• Fixed a bug where the wrong REF allele was output to the combined gVCF.

VariantRecalibrator

• Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
• VQSR VCF header command line now contains annotations and tranche levels.

SelectVariants

• Added -trim argument to trim (simplify) alleles to a minimal representation.
• Added -trimAlternates argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
• Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the -noTrim argument to preserve original alleles.
• Added --keepOriginalDP argument.

VariantAnnotator

• Improvements to the allele trimming functionalities.
• Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.

CalculateGenotypePosteriors

• Fixed user-reported bug featuring "trio" family with two children, one parent.
• Added error handling for genotypes that are called but have no PLs.

Various tools

• BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
• DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
• SplitSamFile: Fixed a bug that caused a NullPointerException.
• SplitNCigarReads: Fixed issue to make -fixNDN flag fully functional.
• IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
• CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need -genotypeMergeOption.
• VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
• GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
• ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
• CatVariants: Now allows different input / output file types, and spaces in directory names.
• VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when -SMA is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.

• Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
• Handle X and = CIGAR operators appropriately
• Added -drf argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).

Annotations

• Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
• Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
• Changed GC Content value type from Integer to Float.
• Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
• Made annotators emit a warning if they can't be applied.

GATK Engine & common features

• Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
• Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
• Handle X and = CIGAR operators appropriately.
• Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
• Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
• Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
• Fixed BCF writing when FORMAT annotations contain arrays.

Queue

• Added -qsub-broad argument. When -qsub-broad is specified instead of -qsub, Queue will use the h_vmem parameter instead of h_rss to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
• Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
• Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.

Documentation

• Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.

For developers

• Upgraded SLF4J to allow new convenient logging syntaxes.
• Patched maven pom file for slf4j-log4j12 version (contributed by user Biocyberman).
• Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
• Collected VCF IDs and header lines into one place (GATKVCFConstants).

Created 2014-10-23 18:53:52 | Updated 2015-05-12 17:24:14 | Tags: Troll haplotypecaller ploidy release-notes genotypegvcfs gatk3 genotyperefinement

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

Haplotype Caller

• Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
• Physical phasing information is output by default in new sample-level PID and PGT tags.
• Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Fixed IndexOutOfBounds error associated with tail merging.

Variant Recalibrator

• New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.

GenotypeGVCFs

• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
• Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
• Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

RNAseq analysis

• Bug fixes for working with unmapped reads.

CalculateGenotypePosteriors

• New annotation for low- and high-confidence possible de novos (only annotates biallelics).
• FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
• Restricted population priors based on discovered allele count to be valid for 10 or more samples.

DepthOfCoverage

• Fixed rare bug triggered by hash collision between sample names.

SelectVariants

• Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.

• Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
• The performance penalty associated with filtering by read group has been essentially eliminated.

Annotations

• StrandOddsRatio is now a standard annotation that is output by default.
• We used to output zero for FS if there was no data available at a site, now we omit FS.
• Extensive rewrite of the annotation documentation.

Queue

• Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
• Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)

Documentation

• We now generate PHP files instead of HTML.
• We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.

Miscellaneous

• Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
• htsjdk updated to version 1.120.1620

Created 2014-07-15 03:54:06 | Updated 2014-10-23 17:58:36 | Tags: variantrecalibrator haplotypecaller selectvariants variantannotator release-notes catvariants genotypegvcfs gatk-3-2

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.

Haplotype Caller

• Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
• Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
• The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
• Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
• Fixed small bug where indel discovery was inconsistent based on the active-region size.
• Removed scary warning messages for "VectorPairHMM".
• Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
• When we subset PLs because alleles are removed during genotyping we now also subset the AD.
• Fixed bug where reference sample depth was dropped in the DP annotation.

Variant Recalibrator

• The -mode argument is now required.
• The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

AnalyzeCovariates

• The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

Variant Annotator

• SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

Genotype GVCFs

• Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
• Fixed recurring error about missing PLs.
• No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
• --includeNonVariantSites should now be working.

Select Variants

• The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

Indel Realigner

• Now throws a user error when it encounters reads with I operators greater than the number of read bases.
• Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.

CalculateGenotypePosteriors

• Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
• Now computes genotype posteriors using likelihoods from all members of the trio.
• Added annotations for calling potential de novo mutations.
• Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

Cat Variants

• Can now process .list files with -V.
• Can now handle BCF and Block-Compressed VCF files.

Validate Variants

• Now works with gVCF files.
• By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.

FastaAlternateReferenceMaker

• Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.

Miscellaneous

• Refactored maven directories and java packages replacing "sting" with "gatk".
• Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
• Added a new read transformer that refactors NDN cigar elements to one N element.
• Now a Tabix index is created for block-compressed output formats.
• Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
• Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
• We now check that output files are specified in a writeable location.
• We now allow blank lines in a (non-BAM) list file.
• Added legibility improvements to the Progress Meter.
• Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
• Made IntervalSharder respect the IntervalMergingRule specified on the command line.
• Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.

Created 2014-03-17 16:52:43 | Updated 2014-03-19 15:13:51 | Tags: variantrecalibrator haplotypecaller randomlysplitvariants release-notes fastaalternatereferencemaker

GATK 3.1 was released on March 18, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Haplotype Caller

• Added new capabilities to the Haplotype Caller to use hardware-based optimizations. Can be enabled with --pair_hmm_implementation VECTOR_LOGLESS_CACHING. Please see the 3.1 Version Highlights for more details about expected speed ups and some background on the collaboration that made these possible.
• Fixed bugs in computing the weights of edges in the assembly graph. This was causing bad genotypes to be output when running the Haplotype Caller over multiple samples simultaneously (as opposed to creating gVCFs in the new recommended pipeline, which was working as expected).

Variant Recalibrator

• Fixed issue where output could be non-deterministic with very large data sets.

CalculateGenotypePosteriors

• Fixed several bugs where bad input were causing the tool to crash instead of gracefully exiting with an error message.

Miscellaneous

• RandomlySplitVariants can now output splits comprised of more than 2 output files.
• FastaAlternateReferenceMaker can now output heterozygous sites using IUPAC ambiguity encoding.
• Picard, Tribble, and Variant jars updated to version 1.109.1722.

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.

• This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

Haplotype Caller

• Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
• Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
• Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
• Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
• Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.

CombineGVCFs

• This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.

GenotypeGVCFs

• This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.

• This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

Unified Genotyper

• Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

Variant Recalibrator

• Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

• Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

Variant Annotator

• Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
• Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
• In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

Combine Variants

• Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Select Variants

• Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
• Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.

CalculateGenotypePosteriors

• Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
• It no longer strips out the AD annotation from genotypes.
• AC/AF/AN counts are updated after fixing genotypes.
• Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

Indel Realigner

• Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

• Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.

Miscellaneous

• Thanks to Phillip Dexheimer for several Queue related fixes and patches.
• Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
• Providing an empty BAM list no results in a user error.
• Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
• Logger output now goes to STDERR instead of STDOUT.
• Picard, Tribble, and Variant jars updated to version 1.107.1683.

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.

Unified Genotyper

• Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

Haplotype Caller

• Improved the indexing scheme for gVCF outputs using the reference calculation model.
• The reference calculation model now works with reduced reads.
• Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
• Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

Variant Recalibrator

• Disable tranche plots in INDEL mode.
• Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

• Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

Diagnose Targets

• Added calculation for GC content.
• Added an option to filter the bases based on their quality scores.

Combine Variants

• Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

Select Variants

• Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.

Miscellaneous

• SplitSamFile now produces an index with the BAM.
• Length metric updates to QualifyMissingIntervals.
• Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
• Picard jar updated to version 1.104.1628.
• Tribble jar updated to version 1.104.1628.
• Variant jar updated to version 1.104.1628.

Created 2013-08-21 21:15:21 | Updated 2014-02-08 20:09:15 | Tags: indelrealigner unifiedgenotyper variantrecalibrator official haplotypecaller reducereads release-notes

GATK 2.7 was released on August 21, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

• Changed the underlying convention of having unstranded reduced reads; instead there are now at least 2 compressed reads at every position, one for each strand (forward and reverse). This allows us to maintain strand information that is useful for downstream filtering.
• Fixed bug where representative depths were arbitrarily being capped at 127 (instead of the expected 255).
• Fixed bug where insertions downstream of a variant region weren't triggering a stop to the compression.
• Fixed bug when using --cancer_mode where alignments were being emitted out of order (and causing the tool to fail).

Unified Genotyper

• Added --onlyEmitSamples argument that, when provided, instructs that caller to emit only the selected samples into the VCF (even though the calling is performed over all samples present in the provided bam files).
• FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
• Added a (very) experimental argument (allSitePLs) that will have the caller emit PLs for all sites (including reference sites). Note that this does not give a fully accurate reference model because it models only SNPs. Full a proper handling of the reference model, please use the Haplotype Caller.

Haplotype Caller

• Added a still somewhat experimental PCR indel error model to the Haplotype Caller. By default this modeling is turned on and is very useful for removing false positive indel calls associated with PCR slippage around short tandem repeats (esp. homopolymers). Users have the option (with the --pcr_indel_model argument) of turning it off or making it even more aggressive (at the expense of losing some true positives too).
• Added the ability to emit accurate likelihoods for non-variant positions (i.e. what we call a "reference model" that incorporates indels as well as SNP confidences at every position). The output format can be either a record for every position or use the gVCF style recording of blocks. See the --emitRefConfidence argument for more details; note that this replaces the use of "--output_mode EMIT_ALL_SITES" in the HaplotypeCaller.
• Improvements to the internal likelihoods that are generated by the Haplotype Caller. Specifically, this tool now uses a tri-state correction like the Unified Genotyper, corrects for overlapping read pairs (from the same underlying fragment), and does not run contamination removal (allele-biased downsampling) by default.
• Several small runtime performance improvements were added (although we are still hard at work on larger improvements that will allow calling to scale to many samples; we're just not there yet).
• Fixed bug in how adapter clipping was performed (we now clip only after reverting soft-clipped bases).
• FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
• Improved the "dangling tail" recovery in the assembly algorithm, which allows for higher sensitivity in calling variants at the edges of coverage (e.g. near the ends of targets in an exome).
• Added the ability to run allele-biased downsampling with different per-sample values like the Unified Genotyper (contributed by Yossi Farjoun).

Variant Annotator

• Fixed bug where only the last -comp was being annotated at a site.

Indel Realigner

• Fixed bug that arises because of secondary alignments and that was causing the tool not to update the alignment start of the mate when a read was realigned.

Phase By Transmission

• Fixed bug where multi-allelic records were being completely dropped by this tool. Now they are emitted unphased.

Variant Recalibrator

• General improvements to the Gaussian modeling, mostly centered around separating the parameters for the positive and negative training models.
• Added mode to not emit (at all) variant records that are filtered out.
• This tool now automatically orders the annotation dimensions by their standard deviation instead of the order they were specified on the command-line in order to stabilize the training and have it produce optimal results.
• Fixed bug where the tool occasionally produced bad log10 values internally.

Miscellaneous

• General performance improvements to the VCF reading code contributed by Michael McCowan.
• Error messages are much less verbose and "scary."
• Fixed the ReadBackedPileup class to represent mapping qualities as ints, not (signed) bytes.
• Added the engine-wide ability to do on-the-fly BAM file sample renaming at runtime (see the documentation for the --sample_rename_mapping_file argument for more details).
• Fixed bug in how the GATK counts filtered reads in the traversal output.
• Added a new tool called Qualify Intervals.
• Fixed major bug in the BCF encoding (the previous version was producing problematic files that were failing when trying to be read back into the GATK).
• Picard/sam/tribble/variant jars updated to version 1.96.1534.

Created 2013-06-17 14:41:43 | Updated 2013-06-20 03:43:19 | Tags: official release-notes

GATK 2.6 was released on June 20, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Important note: with this release the GATK has officially moved to using Java 7.

• Small runtime performance improvements contributed by Michael McCowan.
• Added fix for the "Removed too many insertions, header is now negative" bug.
• Fixed bug that arises in multi-sample mode and causes the tool to crash.
• Added --cancer_mode argument to force the user to explicitly enable multi-sample mode.

Unified Genotyper

• Runtime performance improvements when calling indels; calling indels in a single sample is almost 2x faster in our tests.
• Fixed bug for GENOTYPE_GIVEN_ALLELES mode where it silently fails to genotype indels in some cases.

Haplotype Caller

• We have been working hard to reduce the number of false negatives (i.e. missed sites) for the Haplotype Caller and as such added a bunch of improvements to this tool. The sensitivity is now better than that of the Unified Genotyper is all of our whole genome tests for both SNPs and indels. Feel free to peruse the detailed version history for more information.
• The Haplotype Caller now annotates IDs from dbSNP properly.
• The Haplotype Caller now emits per-sample DP.
• Fixed bug with error: "Only one of refStart or refStop must be < 0, not both" that arose from soft-clipped reads at the beginning of contigs.
• Implemented a much improved version of GENOTYPE_GIVEN_ALLELES mode in the Haplotype Caller that works so much better.

Indel Realigner

• Fixed bug where secondary alignments were not being handled correctly.

Genotype Concordance

• Added an overall genotype concordance metric to the output.
• Fixed a bug in the printout of molten data in how it treated the genotypes.

Diagnose Targets

• Diagnose Targets now has an option to output missing intervals.
• Fixed bug where sometimes intervals were emitted out of order.

Base Recalibrator

• Fixed bug for reads with indel CIGAR operators (I or D) at the start/end of the read.
• Introduced a new tool, AnalyzeCovariates, to generate the BQSR quality assessment plots as a separate step, instead of doing it through the BaseRecalibrator.

Combine Variants

• We no longer add PASS to the FILTER field of unfiltered records.

Variant Annotator

• The RMSMappingQuality annotation now works properly with reduced reads.
• The various rank sum tests no longer use reduced reads in their calculations (because those reads do not represent distinct observations).
• Fixed bug in the BaseQualityRankSumTest annotation where it was not actually using the base qualities.
• Added a new annotation DepthPerSampleHC that is used by default in the HaplotypeCaller.

Miscellaneous

• James Warren contributed a patch to have references with non-suffix ".fa" parse correctly.
• We now emit the GATK version number in the header of VCFs that we produce.
• Fixed bug in the up front downsampling used by the GATK: reduced reads are no longer allowed to be eliminated during downsampling.
• dbSNP rsID matching is now smarter: variants are considered matching if they have the same reference allele and at least 1 common alternative allele.
• We now warn users about using the GATK with RNA-seq data.
• We now check that -compress arguments are within allowable range 0-9.
• -rf ReassignMappingQuality can now be used to reassign mapping qualities to 60 before the engine filters them out with MappingQualityUnassigned.
• Fixed bug where requesting gzip VCF output with multi-threading was causing the GATK to fail.
• We now require a minimum -dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage.
• Zero-length and repeated cigar elements are collapsed down by default in the engine.
• -ds option removed from PrintReads because it was redundant with the engine-level -dfrac argument.
• Fixed bug where the --defaultBaseQualities argument didn't always work.
• The engine now produces much more accurate read counts for Read traversals.
• Count Reads now uses a Long instead of an Integer for counts to prevent overflows.
• Locus Walkers now only try to clip adaptors when both reads of the pair are on opposite strands.
• Fixed VCF issue where PLs were capped at 32767.
• Picard/Tribble/Variant jars updated to version 1.91.1453.

Created 2013-04-30 20:18:26 | Updated 2013-05-06 15:51:39 | Tags: official release-notes

GATK 2.5 was released on April 30, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

• DRASTIC improvements in the compression algorithm plus myriad bug fixes. Too many to list here; see detailed version history for more information.

Unified Genotyper

• Fixed bug for indel calling with really long reads (assigning the wrong genotypes).
• Automatic contamination fixing now works on reduced reads.
• Fixed rare bug in the general ploidy SNP likelihood model when there are no informative reads in a pileup.
• Fixed bug where haplotypes with 0 bases were being created.
• Fixed problem where our internal PairHMM was generating positive likelihoods.

Haplotype Caller

• Comprehensive performance improvements to the accuracy of calling both SNPs and indels; runtime is also much improved (but still slower than the Unified Genotyper; we expect it to be faster than UG in the next release though). See detailed version history for more information.
• Fixed bug for calling on reduced reads (counts were not being assigned correctly).
• Fixed problem where our internal PairHMM was generating positive likelihoods.
• Can now write BAMs showing the assembled haplotypes.

Diagnose Targets

• Significantly refactored this tool; it now works with a "plugin" system (see documentation for more information).
• Fixed bug where LOW_MEDIAN_COVERAGE was output when no reads are covering the interval.
• Fixed bug where intervals were skipped when they were not covered by any reads.

Base Recalibrator

• Fixed the tool to work correctly with empty BQSR tables.
• Fixed issue where Print Reads was running out of disk space when using the -BQSR option even for small bam files.
• Fixed bug for RNA seq alignments with Ns.

Select Variants

• Fixed bug where using the --exclude_sample_file argument was giving bad results.
• Fixed bug when using the --keepOriginalAC argument which caused it to emit bad VCFs.
• Fixed bug where maxIndelSize argument wasn't getting applied to deletions.

Variant Annotator

• Added support for snpEff "GATK compatibility mode".
• Can now list available annotations by doing java -cp GenomeAnalysisTK.jar org.broadinstitute.sting.tools.ListAnnotations
• QualByDepth remaps QD values > 40 to a gaussian around 30.
• Removed several deprecated annotations (AverageAltAlleleLength, MappingQualityZeroFraction, and TechnologyComposition) and others are no longer marked as experimental.

Variant Filtration

• Don't allow users to specify keys and IDs that contain angle brackets or equals signs (which are not allowed in the VCF specification).
• Added feature that allows one to filter sites outside of a given mask.

Left Align Variants

• Renamed to LeftAlignAndTrimVariants.
• Added ability to trim common bases in front of indels before left-aligning.
• Added ability to split multiallelic records and then left align them.

Miscellaneous

• We removed the auto-creation of fai/dict files for fasta references because it was too buggy.
• Fixed bug where we could fail to find the intersection of unsorted/missorted interval lists.
• Fixed @PG tag uniqueness issue with BAMs we were producing.
• Fixed rare bug in GenotypeConcordance for multi-allelic sites.
• Added check for reads without stored bases (i.e. that use '*') which we do not support.
• Added a new walker to split MNPs into their allelic primitives (SNPs).
• We no longer allow the use of compressed (.gz) references in the GATK.
• Picard/Tribble/Variant jars updated to version 1.90.1442.

Created 2013-02-25 16:03:09 | Updated 2013-03-11 13:33:57 | Tags: official release-notes

GATK 2.4 was released on February 26, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Important note 1 for this release: with this release comes an updated licensing structure for the GATK. Different files in our public repository are protected with different licenses, so please see the text at the top of any given file for details as to its particular license.

Important note 2 for this release: the GATK team spent a tremendous amount of time and engineering effort to add extensive tests for many of our core tools (a process that will continue into future releases). Unsurprisingly, as part of this process many small (and some not so small) bugs were uncovered during testing that we subsequently fixed. While we usually attempt to enumerate in our release notes all of the bugs fixed during a given release, that would entail quite a Herculean effort for release 2.4; so please just be aware that there were many smaller fixes that may be omitted from these notes.

Base Quality Score Recalibration

• The underlying calculation of the recalibration has been improved and generalized so that the empirical quality is now calculated through a Bayesian estimate. This radically improves the accuracy in particular for bins with small numbers of observations.
• Added many run time improvements so that this tool now runs much faster.
• Print Reads writes a header when used with the -BQSR argument.
• Added a check to make sure that BQSR is not being run on a reduced bam (which would be bad).
• The --maximum_cycle_value argument can now be specified during the Print Reads step to prevent problems when running on bams with extremely long reads.
• Fixed bug where reads with an existing BQ tag and soft-clipped bases could cause the tool to error out.

Unified Genotyper

• Fixed the QUAL calculation for monomorphic (homozygous reference) sites (the math for previous versions was not correct).
• Biased downsampling (i.e. contamination removal) values can now be specified as per-sample fractions.
• Fixed bug where biased downsampling (i.e. contamination removal) was not being performed correctly in the presence of reduced reads.
• The indel likelihoods calculation had several bugs (e.g. sometimes the log likelihoods were positive!) that manifested themselves in certain situations and these have all been fixed.
• Small run time improvements were added.

Haplotype Caller

• Extensive performance improvements were added to the Haplotype Caller. This includes run time enhancements (it is now much faster than previous versions) plus improvements in accuracy for both SNPs and indels. Internal assessment now shows the Haplotype Caller calling variants more accurately than the Unified Genotyper. The changes for this tool are so extensive that they cannot easily be enumerated in these notes.

Variant Annotator

• The QD annotation is now divided by the average length of the alternate allele (weighted by the allele count); this does not affect SNPs but makes the calculation for indels much more accurate.
• Fixed Fisher Strand annotation where p-values sometimes summed to slightly greater than 1.0.
• Fixed Fisher Strand annotation for indels where reduced reads were not being handled correctly.
• The Haplotype Score annotation no longer applies to indels.
• Added the Variant Type annotation (not enabled by default) to annotate the VCF record with the variant type.
• The DepthOfCoverage annotation has been renamed to Coverage.

• Several small run time improvements were added to make this tool slightly faster.
• By default this tool now uses a downsampling value of 40x per start position.

Indel Realigner

• Fixed bug where some reads with soft clipped bases were not be realigned.

Combine Variants

• Run time performance improvements added where one uses the PRIORITIZE or REQUIRE_UNIQUE options.

Select Variants

• The --regenotype functionality has been removed from SelectVariants and transferred into its own tool: RegenotypeVariants.

Variant Eval

• Removed the GenotypeConcordance evaluation module (which had many bugs) and converted it into its own tested, standalone tool (called GenotypeConcordance).

Miscellaneous

• The VariantContext and related classes have been moved out of the GATK codebase and into Picard's public repository. The GATK now uses the variant.jar as an external library.
• Added a new Read Filter to reassign just a particular mapping quality to another one (see the ReassignOneMappingQualityFilter).
• Added the Regenotype Variants tool that allows one to regenotype a VCF file (which must contain likelihoods in the PL field) after samples have been added/removed.
• Added the Genotype Concordance tool that calculates the concordance of one VCF file against another.
• Bug fix for VariantsToVCF for records where old dbSNP files had '-' as the reference base.
• The GATK now automatically converts IUPAC bases in the reference to Ns and errors out on other non-standard characters.
• Fixed bug for the DepthOfCoverage tool which was not counting deletions correctly.
• Added Cat Variants, a standalone tool to quickly combine multiple VCF files whose records are non-overlapping (e.g. as produced during scatter-gather).
• The Somatic Indel Detector has been removed from our codebase and moved to the Broad Cancer group's private repository.
• Fixed Validate Variants rsID checking which wasn't working if there were multiple IDs.
• Picard jar updated to version 1.84.1337.
• Tribble jar updated to version 1.84.1337.
• Variant jar updated to version 1.85.1357.

Created 2012-12-17 14:56:06 | Updated 2012-12-18 20:21:23 | Tags: official release-notes

GATK 2.3 was released on December 17, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Base Quality Score Recalibration

• Soft clipped bases are no longer counted in the delocalized BQSR.
• The user can now set the maximum allowable cycle with the --maximum_cycle_value argument.

Unified Genotyper

• Minor (5%) run time improvements to the Unified Genotyper.
• Fixed bug for the indel model that occurred when long reads (e.g. Sanger) in a pileup led to a read starting after the haplotype.
• Fixed bug in the exact AF calculation where log10pNonRefByAllele should really be log10pRefByAllele.

Haplotype Caller

• Fixed the performance of GENOTYPE_GIVEN_ALLELES mode, which often produced incorrect output when passed complex events.
• Fixed the interaction with the allele biased downsampling (for contamination removal) so that the removed reads are not used for downstream annotations.
• Implemented minor (5-10%) run time improvements to the Haplotype Caller.
• Fixed the logic for determining active regions, which was a bit broken when intervals were used in the system.

Variant Annotator

• The FisherStrand annotation ignores reduced reads (because they are always on the forward strand).
• Can now be run multi-threaded with -nt argument.

• Fixed bug where sometime the start position of a reduced read was less than 1.
• ReduceReads now co-reduces bams if they're passed in toghether with multiple -I.

Combine Variants

• Fixed the case where the PRIORITIZE option is used but no priority list is given.

Phase By Transmission

• Fixed bug where the AD wasn't being printed correctly in the MV output file.

Miscellaneous

• A brand new version of the per site down-sampling functionality has been implemented that works much, much better than the previous version.
• More efficient initial file seeking at the beginning of the GATK traversal.
• Fixed the compression of VCF.gz where the output was too big because of unnecessary call to flush().
• The allele biased downsampling (for contamination removal) has been rewritten to be smarter; also, it no longer aborts if there's a reduced read in the pileup.
• Added a major performance improvement to the GATK engine that stemmed from a problem with the NanoSchedule timing code.
• Added checking in the GATK for mis-encoded quality scores.
• Fixed downsampling in the ReadBackedPileup class.
• Fixed the parsing of genome locations that contain colons in the contig names (which is allowed by the spec).
• Made ID an allowable INFO field key in our VCF parsing.
• Multi-threaded VCF to BCF writing no longer produces an invalid intermediate file that fails on merging.
• Picard jar remains at version 1.67.1197.
• Tribble jar updated to version 119.

GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Base Quality Score Recalibration

• Improved the algorithm around homopolymer runs to use a "delocalized context".
• Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
• Fixed bug where the tool failed for reads that begin with insertions.
• Fixed bug in the scatter-gather functionality.
• Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).

Unified Genotyper

• Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
• The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
• Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
• Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
• Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
• Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
• Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
• Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
• The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
• Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
• Generalized ploidy model now handles reference calls correctly.

Haplotype Caller

• Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
• Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
• Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
• Now requires at least 10 samples to merge variants into complex events.

Variant Annotator

• Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.

• Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
• Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
• Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.

Variant Filtration

• Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.

Variant Eval

• AlleleCount stratification now supports records with ploidy other than 2.

Combine Variants

• Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
• Now outputs the first non-missing QUAL, not the maximum.

Select Variants

• Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
• Removed the -number argument because it gave biased results.

Validate Variants

• Added option to selectively choose particular strict validation options.
• Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
• improved the error message around unused ALT alleles.

Somatic Indel Detector

• Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).

Miscellaneous

• Fixed raw HapMap file conversion bug in VariantsToVCF.
• Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
• Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
• Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
• Fixed bug in BCF2 writer for case where all genotypes are missing.
• Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
• Fixed bug in Phase By Transmission when there are no likelihoods present.
• Fixed bug in fasta .fai generation.
• Picard jar remains at version 1.67.1197.
• Tribble jar remains at version 110.

Created 2012-08-20 18:52:48 | Updated 2012-08-23 14:11:29 | Tags: unifiedgenotyper official baserecalibrator combinevariants haplotypecaller selectvariants varianteval release-notes

Base Quality Score Recalibration

• Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
• Implemented support for SOLiD no call strategies other than throwing an exception.
• Fixed smoothing in the BQSR bins.
• Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.

Unified Genotyper

• Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
• UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
• Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
• In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
• Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.

Haplotype Caller

• Added LowQual filter to the output when appropriate.
• Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
• Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
• Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
• Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
• Fixed bug where non-standard bases from the reference would cause errors.
• Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.

• Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
• Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
• Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.

Variant Eval

• Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
• Fixed incorrect allele counting in IndelSummary evaluation.

Combine Variants

• Now outputs the first non-MISSING QUAL, instead of the maximum.
• Now supports multi-threaded running (with the -nt argument).

Select Variants

• Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
• No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
• If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).

Miscellaneous

• GATK now generates a proper error when a gzipped FASTA is passed in.
• Various improvements throughout the BCF2-related code.
• Removed various parallelism bottlenecks in the GATK.
• Added support of X and = CIGAR operators to the GATK.
• Catch NumberFormatExceptions when parsing the VCF POS field.
• Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
• Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
• We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
• Added support for handling complex events in ValidateVariants.
• Picard jar remains at version 1.67.1197.
• Tribble jar remains at version 110.

Created 2012-07-23 19:16:29 | Updated 2012-08-10 00:07:47 | Tags: official release-notes

The GATK 2.0 release includes both the addition of brand-new (and often still experimental) tools and updates to the existing stable tools.

New Tools

• Base Recalibrator (BQSR v2), an upgrade to CountCovariates/TableRecalibration that generates base substitution, insertion, and deletion error models.
• Reduce Reads, a BAM compression algorithm that reduces file sizes by 20x-100x while preserving all information necessary for accurate SNP and indel calling. ReduceReads enables the GATK to call tens of thousands of deeply sequenced NGS samples simultaneously.
• HaplotypeCaller, a multi-sample local de novo assembly and integrated SNP, indel, and short SV caller.
• Plus powerful extensions to the Unified Genotyper to support variant calling of pooled samples, mitochondrial DNA, and non-diploid organisms. Additionally, the extended Unified Genotyper introduces a novel error modeling approach that uses a reference sample to build a site-specific error model for SNPs and indels that vastly improves calling accuracy.

Base Quality Score Recalibration

• IMPORTANT: the Count Covariates and Table Recalibration tools (which comprise BQSRv1) have been retired! Please see the BaseRecalibrator tool (BQSRv2) for running recalibration with GATK 2.0.

Unified Genotyper

• Handle exception generated when non-standard reference bases are present in the fasta.
• Bug fix for indels: when checking the limits of a read to clip, it wasn't considering reads that may already have been clipped before.
• Now emits the MLE AC and AF in the INFO field.
• Don't allow N's in insertions when discovering indels.

Phase By Transmission

• Multi-allelic sites are now correctly ignored.
• Reporting of mendelian violations is enhanced.
• Corrected TP overflow.
• Fixed bug that arose when no PLs were present.
• Added option to output the father's allele first in phased child haplotypes.
• Fixed a bug that caused the wrong phasing of child/father pairs.

Variant Eval

• Improvements to the validation report module: if eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.
• If present, the AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC).
• Fixed bugs in the VariantType and IndelSize stratifications.

Variant Annotator

• FisherStrand annotation no longer hard-codes in filters for bases/reads (previously used MAPQ > 20 && QUAL > 20).
• Miscellaneous bug fixes to experimental annotations.
• Added a Clipping Rank Sum Test to detect when variants are present on reads with differential clipping.
• Fixed the ReadPos Rank Sum Test annotation so that it no longer uses the un-hardclipped start as the alignment start.
• Fixed bug in the NBaseCount annotation module.
• The new TandemRepeatAnnotator is now a standard annotation while HRun has been retired.
• Added PED support for the Inbreeding Coefficient annotation.
• Don't compute QD if there is no QUAL.

Variant Quality Score Recalibration

• The VCF index is now created automatically for the recalFile.

Variant Filtration

• Now allows you to run with type unsafe JEXL selects, which all default to false when matching.

Select Variants

• Added an option which allows the user to re-genotype through the exact AF calculation model (if PLs are present) in order to recalculate the QUAL and genotypes.

Combine Variants

• Added --mergeInfoWithMaxAC argument to keep info fields from the input with the highest AC value.

Somatic Indel Detector

• GT header line is now output.

Indel Realigner

• Automatically skips Ion reads just like it does with 454 reads.

Variants To Table

• Genotype-level fields can now be specified.
• Added the --moltenize argument to produce molten output of the data.

Depth Of Coverage

• Fixed a NullPointerException that could occur if the user requested an interval summary but never provided a -L argument.

Miscellaneous

• BCF2 support in tools that output VCFs (use the .bcf extension).
• The GATK Engine no longer automatically strips the suffix "Walker" after the end of tool names; as such, all tools whose name ended with "Walker" have been renamed without that suffix.
• Fixed bug when specifying a JEXL expression for a field that doesn't exist: we now treat the whole expression as false (whereas we were rethrowing the JEXL exception previously).
• There is now a global --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends).
• Removed all code associated with extended events.
• Algorithmically faster version of DiffEngine.
• Better down-sampling fixes edge case conditions that used to be handled poorly. Read Walkers can now use down-sampling.
• GQ is now emitted as an int, not a float.
• Fixed bug in the Beagle codec that was skipping the first line of the file when decoding.
• Fixed bug in the VCF writer in the case where there are no genotypes for a record but there are genotypes in the header.
• Miscellaneous fixes to the VCF headers being produced.