# Tagged with #gatk3 1 documentation article | 5 announcements | 6 forum discussions

Created 2014-04-16 21:58:52 | Updated 2015-05-16 06:55:15 | Tags: best-practices gatk3

This article is part of the Best Practices documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full documentation set.

The "GATK Best Practices" are workflow descriptions that provide step-by-step recommendations for getting the best analysis results possible out of high-throughput sequencing data. At present, we provide the following Best Practice workflows:

These recommendations have been developed by the GATK development team over years of analysis work on many of the Broad Institute's sequencing projects, and are applied in the Broad's production pipelines. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable.

#### Important notes on context and caveats

Our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. Unfortunately we are not able to provide official recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent).

In addition, the illustrations and tutorials provided in these pages tend to assume a simple experimental design where each sample is used to produce one DNA library that is sequenced separately on one lane of the machine. See the Guide for help dealing with other experimental designs.

Finally, please be aware that several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

Created 2015-05-15 04:52:05 | Updated | Tags: haplotypecaller release-notes genotypegvcfs gatk3

GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights to be published soon.

Note that the release is in progress at time of posting -- it may take a couple of hours before the new GATK jar file is updated on the downloads page.

### New tool

• ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.

### HaplotypeCaller & GenotypeGVCFs

• Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
• Deprecated --mergeVariantsViaLD argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
• Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
• Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
• Fixed the implementation of allowNonUniqueKmersInRef so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
• Initialize annotations so that --disableDithering actually works.
• Automatic selection of indexing strategy based on .g.vcf file extension. See Highlights for more details.
• Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
• Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
• Removed GenotypeSummaries from default annotations.
• Added -uniquifySamples to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
• Disallow changing -dcov setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
• Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
• Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
• Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
• Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
• Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
• Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).

### CombineGVCFs

• Added option to break blocks at every N sites. Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way -L works in the GATK for VCF records with the END tag). See Highlights for more details.
• Fixed a bug that caused the tool to stop processing after the first contig.
• Fixed a bug where the wrong REF allele was output to the combined gVCF.

### VariantRecalibrator

• Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
• VQSR VCF header command line now contains annotations and tranche levels.

### SelectVariants

• Added -trim argument to trim (simplify) alleles to a minimal representation.
• Added -trimAlternates argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
• Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the -noTrim argument to preserve original alleles.
• Added --keepOriginalDP argument.

### VariantAnnotator

• Improvements to the allele trimming functionalities.
• Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.

### CalculateGenotypePosteriors

• Fixed user-reported bug featuring "trio" family with two children, one parent.
• Added error handling for genotypes that are called but have no PLs.

### Various tools

• BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
• DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
• SplitSamFile: Fixed a bug that caused a NullPointerException.
• SplitNCigarReads: Fixed issue to make -fixNDN flag fully functional.
• IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
• CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need -genotypeMergeOption.
• VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
• GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
• ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
• CatVariants: Now allows different input / output file types, and spaces in directory names.
• VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when -SMA is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.

### Read Filters

• Added erroneous CIGAR length to criteria for BadCigarFilter.
• Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
• Handle X and = CIGAR operators appropriately
• Added -drf argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).

### Annotations

• Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
• Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
• Changed GC Content value type from Integer to Float.
• Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
• Made annotators emit a warning if they can't be applied.

### GATK Engine & common features

• Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
• Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
• Handle X and = CIGAR operators appropriately.
• Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
• Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
• Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
• Fixed BCF writing when FORMAT annotations contain arrays.

### Queue

• Added -qsub-broad argument. When -qsub-broad is specified instead of -qsub, Queue will use the h_vmem parameter instead of h_rss to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
• Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
• Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.

### Documentation

• Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.

### For developers

• Upgraded SLF4J to allow new convenient logging syntaxes.
• Patched maven pom file for slf4j-log4j12 version (contributed by user Biocyberman).
• Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
• Collected VCF IDs and header lines into one place (GATKVCFConstants).
• Made various changes that lead to reduced build times.

Created 2014-10-23 18:53:52 | Updated 2015-05-12 17:24:14 | Tags: Troll haplotypecaller ploidy release-notes genotypegvcfs gatk3 genotyperefinement

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

## Haplotype Caller

• Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
• Physical phasing information is output by default in new sample-level PID and PGT tags.
• Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Fixed IndexOutOfBounds error associated with tail merging.

## Variant Recalibrator

• New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.

## GenotypeGVCFs

• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
• Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
• Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

## RNAseq analysis

• Bug fixes for working with unmapped reads.

## CalculateGenotypePosteriors

• New annotation for low- and high-confidence possible de novos (only annotates biallelics).
• FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
• Restricted population priors based on discovered allele count to be valid for 10 or more samples.

## DepthOfCoverage

• Fixed rare bug triggered by hash collision between sample names.

## SelectVariants

• Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.

## PrintReads

• Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
• The performance penalty associated with filtering by read group has been essentially eliminated.

## Annotations

• StrandOddsRatio is now a standard annotation that is output by default.
• We used to output zero for FS if there was no data available at a site, now we omit FS.
• Extensive rewrite of the annotation documentation.

## Queue

• Fixed Queue bug with bad localhost addresses.
• Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
• Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)

## Documentation

• We now generate PHP files instead of HTML.
• We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.

## Miscellaneous

• Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
• htsjdk updated to version 1.120.1620

Created 2014-09-02 04:41:02 | Updated | Tags: phone-home license gatk3

We've made a few minor updates to the license text as follows:

• Updated the major version number from GATK2 to GATK3;
• Updated the Broad address from 7 Cambridge Center to 415 Main Street (simple renaming, no move involved);
• Added an explicit mention of the Phone Home reporting feature (which you can read more about here);
• Clarified the terms relative to third-party contributions to protected code (section 2.1 - Grant).

None of these changes should have any effect on anyone's ability to use GATK.

Created 2014-04-10 17:39:08 | Updated | Tags: video presentations appistry webinar slides gatk3

The slides from today's webinar are available as of now in the GSA team Dropbox at this link, and will be on the documentation website shortly.

Created 2014-04-04 01:48:46 | Updated | Tags: appistry rnaseq webinar gvcf gatk3

Our partners at Appistry are putting on another webinar next week, and this one's going to be pretty special in our view -- because we're going to be doing pretty much all the talking!

Titled "Speed, Cohorts, and RNAseq: An Insider Look into GATK 3" (see that link for the full program), this webinar will be all about the GATK 3 features, of course. And lest you think this is just another marketing pitch (no offense, marketing people), rest assured that we will be diving into the gory technical details of what happens under the hood. This is a great opportunity to get the inside scoop on how the new features (RNAseq, GVCF pipeline etc) work -- all the stuff that's fit to print, but that we haven't had time to write down in the docs yet. So don't miss it if that's the sort of thing that floats your boat! Or if you miss it, be sure to check out the recording afterward.

As usual the webinar is completely free and open to everyone (not just Appistry customers or prospective for-profit users). All you need to do is register now and tune in on Thursday 4/10.

Talk to you then!

Created 2015-01-09 16:47:35 | Updated | Tags: bqsr haplotypecaller variantfiltration vcf bam workflows snps gatk3

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

• 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
• 02- BAM: MarkDuplicates
• 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
• 04- BAM: Calling variants using HaplotypeCaller
• 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
• 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
• 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
• 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

Created 2014-10-29 13:07:43 | Updated 2014-10-29 13:11:42 | Tags: applyrecalibration tribble variantrecalibration gatk3

Hi there,

I'm having a problem using GATK's ApplyRecalibration tool using this command:

java -jar "GenomeAnalysisTK.jar" -T "ApplyRecalibration" -R human_g1k_v37.fasta -input fileContents.vcf --ts_filter_level "99.9" --mode "SNP" --recal_file dataset_1139.dat_0.recal --tranches_file dataset_1140.dat_0.tranches --out dataset_1142.dat_0.vcf

This is the error:

##### ERROR ------------------------------------------------------------------------------------------

I tried to look for some solutions, I added a :NAME,VCF tag before the vcf file, I tried using another vcf file and I validated the vcf file I'm using, using GATK's ValidateVariants and it was validated with no problem, I tried different versions of GATK including the current version 3.3, and I tried compressing and indexing the vcf file unsing bgzip+tabix but this problem still presists.

The weird thing is that in the log info I get this message before the error: INFO 14:21:11,775 ArgumentTypeDescriptor - Dynamically determined type of fileContents.vcf to be VCF

Any suggestions are appreciated.

Thanks in advance.

Shazly

Created 2014-09-10 03:04:06 | Updated | Tags: genotypegvcfs gatk3

Dear GATK team, http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.html “GenotypeGVCFs merges gVCF records that were produced as part of the reference model-based variant discovery pipeline (see documentation for more details) using the '-ERC GVCF' or '-ERC BP_RESOLUTION' mode of the HaplotypeCaller. This tool performs the multi-sample joint aggregation step and merges the records together in a sophisticated manner. At all positions of the target, this tool will combine all spanning records, produce correct genotype likelihoods, re-genotype the newly merged record, and then re-annotate it. Note that this tool cannot work with just any gVCF files - they must have been produced with the HaplotypeCaller, which uses a sophisticated reference model to produce accurate genotype likelihoods for every position in the target.”

The final sentence says that "accurate genotype likelihoods", how do you define "accurate"? May I ask if it is fully evaluated? If so, could you please tell me how you do the evaluation?

Thank you in advance.

Created 2014-06-17 15:47:14 | Updated | Tags: genotypegvcfs combinegvcfs gatk3

Hello all,

I have a quick question about the results of an CombineGVCFs file while creating large background files. Prior to combining the files a region of the .gvcf file from HC looks like this.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR070473 13 19408518 . A <NON_REF> . . END=19409266 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 13 19409267 . T <NON_REF> . . END=19409323 GT:DP:GQ:MIN_DP:PL 0/0:4:12:2:0,6,60 13 19409324 . C <NON_REF> . . END=19409400 GT:DP:GQ:MIN_DP:PL 0/0:15:42:8:0,24,299

If you noticed the first individual looks like this SRR070473, with a 0/0:0:0:0:0,0,0 recorded, but after combining the file in batches of 200, the same information will be recorded as no call.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR070473 SRR070477 SRR070505 SRR070516 SRR070517 SRR070772 SRR070779 SRR070796 SRR0 13 19408518 . A <NON_REF> . . END=19408519 GT:DP:GQ:MIN_DP:PL ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.

The issue I'm seeing is when you attempt to use GenotypeGVCFs I get the following an error similar to this if using the file containing no call notation.

##### ERROR MESSAGE: cannot merge genotypes from samples without PLs; sample ERR031932 does not have likelihoods at position 1:10929

Created 2014-05-19 15:14:24 | Updated 2014-05-19 15:15:51 | Tags: variantrecalibrator gatk3

Hello all,

Quick issue. I'm currently running VariantRecalibrator using the following command.

Program Args: -T VariantRecalibrator -R /data/srynearson/gatk_reference/human_g1k_v37_decoy.fasta --minNumBadVariants 5000 --num_threads 10 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /data/GATK_Bundle/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/GATK_Bundle/1000G_phase1.snps.high_confidence.b37.vcf -an QD -an:HaplotypeScore -an:MQRankSum -an:ReadPosRankSum -an:FS -input /data2/srynearson/UGP06/fastq/UGP06_genotyped.vcf -recalFile /data2/srynearson/UGP06/fastq/UGP06_snp_recal -tranchesFile /data2/srynearson/UGP06/fastq/UGP06_snp_tranches -rscriptFile /data2/srynearson/UGP06/fastq/UGP06_snp_plots.R -mode SNP

And getting the following error:

ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example).

This seems strange given the "-input /data2/srynearson/UGP06/fastq/UGP06_genotyped.vcf" file above has 100 CEU individuals + one target exome.

I will note that the genotyped.vcf file does only show ~280,000 variants.

I have tried increasing the minNumBadVariants and reducing maxGaussian, but the error either stays the same or changes to "Unable to retrieve result".

Any ideas.

--SR

Created 2014-03-10 16:14:21 | Updated | Tags: stdout gatk3 stderr

I have noticed the walker logs being written to stderr instead of stdout. Is this a new feature or a bug? If I could vote against it, then I would. Unless there is a plan in the future to have GATK adhere to the Unix philosophy like SAMtools currently does; i.e. to operate on piped streams.