Tagged with #gatk3 1 documentation article | 4 announcements | 6 forum discussions

This is a placeholder for a document in preparation.

In the meantime, for a quick introduction to the GATK and the purpose of the Best Practices, please see the intro talks from our latest workshop in the Presentations section.

Important notes on context and caveats

These recommendations have been developed by the GATK development team over years of analysis work on many of the Broad Institute's sequencing projects. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable.

However, our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. Unfortunately we are not able to provide official recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent). In addition, the illustrations and tutorials provided in these pages tend to assume a simple experimental design where each sample is used to produce one DNA library that is sequenced separately on one lane of the machine. See the FAQ articles for help dealing with other experimental designs. Finally, please be aware that several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

Haplotype Caller

• Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
• Physical phasing information is output by default in new sample-level PID and PGT tags.
• Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Fixed IndexOutOfBounds error associated with tail merging.

Variant Recalibrator

• New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.

GenotypeGVCFs

• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
• Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
• Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

RNAseq analysis

• Bug fixes for working with unmapped reads.

CalculateGenotypePosteriors

• New annotation for low- and high-confidence possible de novos (only annotates biallelics).
• FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
• Restricted population priors based on discovered allele count to be valid for 10 or more samples.

DepthOfCoverage

• Fixed rare bug triggered by hash collision between sample names.

SelectVariants

• Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.

• Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
• The performance penalty associated with filtering by read group has been essentially eliminated.

Annotations

• StrandOddsRatio is now a standard annotation that is output by default.
• We used to output zero for FS if there was no data available at a site, now we omit FS.
• Extensive rewrite of the annotation documentation.

Queue

• Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
• Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)

Documentation

• We now generate PHP files instead of HTML.
• We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.

Miscellaneous

• Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
• htsjdk updated to version 1.120.1620

• Updated the major version number from GATK2 to GATK3;
• Updated the Broad address from 7 Cambridge Center to 415 Main Street (simple renaming, no move involved);
• Added an explicit mention of the Phone Home reporting feature (which you can read more about here);
• Clarified the terms relative to third-party contributions to protected code (section 2.1 - Grant).

None of these changes should have any effect on anyone's ability to use GATK.

The slides from today's webinar are available as of now in the GSA team Dropbox at this link, and will be on the documentation website shortly.

Our partners at Appistry are putting on another webinar next week, and this one's going to be pretty special in our view -- because we're going to be doing pretty much all the talking!

Titled "Speed, Cohorts, and RNAseq: An Insider Look into GATK 3" (see that link for the full program), this webinar will be all about the GATK 3 features, of course. And lest you think this is just another marketing pitch (no offense, marketing people), rest assured that we will be diving into the gory technical details of what happens under the hood. This is a great opportunity to get the inside scoop on how the new features (RNAseq, GVCF pipeline etc) work -- all the stuff that's fit to print, but that we haven't had time to write down in the docs yet. So don't miss it if that's the sort of thing that floats your boat! Or if you miss it, be sure to check out the recording afterward.

As usual the webinar is completely free and open to everyone (not just Appistry customers or prospective for-profit users). All you need to do is register now and tune in on Thursday 4/10.

Talk to you then!

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

• 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
• 02- BAM: MarkDuplicates
• 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
• 04- BAM: Calling variants using HaplotypeCaller
• 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
• 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
• 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
• 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

Hi there,

I'm having a problem using GATK's ApplyRecalibration tool using this command:

java -jar "GenomeAnalysisTK.jar" -T "ApplyRecalibration" -R human_g1k_v37.fasta -input fileContents.vcf --ts_filter_level "99.9" --mode "SNP" --recal_file dataset_1139.dat_0.recal --tranches_file dataset_1140.dat_0.tranches --out dataset_1142.dat_0.vcf

This is the error:

ERROR ------------------------------------------------------------------------------------------

I tried to look for some solutions, I added a :NAME,VCF tag before the vcf file, I tried using another vcf file and I validated the vcf file I'm using, using GATK's ValidateVariants and it was validated with no problem, I tried different versions of GATK including the current version 3.3, and I tried compressing and indexing the vcf file unsing bgzip+tabix but this problem still presists.

The weird thing is that in the log info I get this message before the error: INFO 14:21:11,775 ArgumentTypeDescriptor - Dynamically determined type of fileContents.vcf to be VCF

Any suggestions are appreciated.

Shazly

Dear GATK team, http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.html “GenotypeGVCFs merges gVCF records that were produced as part of the reference model-based variant discovery pipeline (see documentation for more details) using the '-ERC GVCF' or '-ERC BP_RESOLUTION' mode of the HaplotypeCaller. This tool performs the multi-sample joint aggregation step and merges the records together in a sophisticated manner. At all positions of the target, this tool will combine all spanning records, produce correct genotype likelihoods, re-genotype the newly merged record, and then re-annotate it. Note that this tool cannot work with just any gVCF files - they must have been produced with the HaplotypeCaller, which uses a sophisticated reference model to produce accurate genotype likelihoods for every position in the target.”

The final sentence says that "accurate genotype likelihoods", how do you define "accurate"? May I ask if it is fully evaluated? If so, could you please tell me how you do the evaluation?

Hello all,

I have a quick question about the results of an CombineGVCFs file while creating large background files. Prior to combining the files a region of the .gvcf file from HC looks like this.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR070473 13 19408518 . A <NON_REF> . . END=19409266 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 13 19409267 . T <NON_REF> . . END=19409323 GT:DP:GQ:MIN_DP:PL 0/0:4:12:2:0,6,60 13 19409324 . C <NON_REF> . . END=19409400 GT:DP:GQ:MIN_DP:PL 0/0:15:42:8:0,24,299

If you noticed the first individual looks like this SRR070473, with a 0/0:0:0:0:0,0,0 recorded, but after combining the file in batches of 200, the same information will be recorded as no call.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR070473 SRR070477 SRR070505 SRR070516 SRR070517 SRR070772 SRR070779 SRR070796 SRR0 13 19408518 . A <NON_REF> . . END=19408519 GT:DP:GQ:MIN_DP:PL ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.

The issue I'm seeing is when you attempt to use GenotypeGVCFs I get the following an error similar to this if using the file containing no call notation.

##### ERROR MESSAGE: cannot merge genotypes from samples without PLs; sample ERR031932 does not have likelihoods at position 1:10929

Hello all,

Quick issue. I'm currently running VariantRecalibrator using the following command.

Program Args: -T VariantRecalibrator -R /data/srynearson/gatk_reference/human_g1k_v37_decoy.fasta --minNumBadVariants 5000 --num_threads 10 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /data/GATK_Bundle/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/GATK_Bundle/1000G_phase1.snps.high_confidence.b37.vcf -an QD -an:HaplotypeScore -an:MQRankSum -an:ReadPosRankSum -an:FS -input /data2/srynearson/UGP06/fastq/UGP06_genotyped.vcf -recalFile /data2/srynearson/UGP06/fastq/UGP06_snp_recal -tranchesFile /data2/srynearson/UGP06/fastq/UGP06_snp_tranches -rscriptFile /data2/srynearson/UGP06/fastq/UGP06_snp_plots.R -mode SNP

And getting the following error:

ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example).

This seems strange given the "-input /data2/srynearson/UGP06/fastq/UGP06_genotyped.vcf" file above has 100 CEU individuals + one target exome.

I will note that the genotyped.vcf file does only show ~280,000 variants.

I have tried increasing the minNumBadVariants and reducing maxGaussian, but the error either stays the same or changes to "Unable to retrieve result".

Any ideas.

--SR

I have noticed the walker logs being written to stderr instead of stdout. Is this a new feature or a bug? If I could vote against it, then I would. Unless there is a plan in the future to have GATK adhere to the Unix philosophy like SAMtools currently does; i.e. to operate on piped streams.