The "GATK Best Practices" are workflow descriptions that provide step-by-step recommendations for getting the best analysis results possible out of high-throughput sequencing data. At present, we provide the following Best Practice workflows:
These recommendations have been developed by the GATK development team over years of analysis work on many of the Broad Institute's sequencing projects, and are applied in the Broad's production pipelines. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable.
Our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. Unfortunately we are not able to provide official recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent).
In addition, the illustrations and tutorials provided in these pages tend to assume a simple experimental design where each sample is used to produce one DNA library that is sequenced separately on one lane of the machine. See the Guide for help dealing with other experimental designs.
Finally, please be aware that several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.
GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights to be published soon.
Note that the release is in progress at time of posting -- it may take a couple of hours before the new GATK jar file is updated on the downloads page.
--mergeVariantsViaLDargument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
allowNonUniqueKmersInRefso that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
.g.vcffile extension. See Highlights for more details.
-uniquifySamplesto GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
-dcovsetting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
--breakBandsAtMultiplesOf Nwill ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way
-Lworks in the GATK for VCF records with the END tag). See Highlights for more details.
-trimargument to trim (simplify) alleles to a minimal representation.
-trimAlternatesargument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
-noTrimargument to preserve original alleles.
-fixNDNflag fully functional.
-SMAis specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.
-drfargument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).
-qsub-broadargument. When -qsub-broad is specified instead of
-qsub, Queue will use the
h_vmemparameter instead of
h_rssto specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
slf4j-log4j12version (contributed by user Biocyberman).
GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.
--sample_nameargument. This is a shortcut for people who have multi-sample BAMs but would like to use
-ERC GVCFmode with a particular one of those samples.
--ignore_all_filtersoption. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.
--keepOriginalAC functionalityin SelectVariants to work for sites that lose alleles in the selection.
read_grouparguments no longer appear in the header.
--bcffor VCF files, and
--generate_md5for BAM files moved to the engine level.
We've made a few minor updates to the license text as follows:
None of these changes should have any effect on anyone's ability to use GATK.
The slides from today's webinar are available as of now in the GSA team Dropbox at this link, and will be on the documentation website shortly.
Our partners at Appistry are putting on another webinar next week, and this one's going to be pretty special in our view -- because we're going to be doing pretty much all the talking!
Titled "Speed, Cohorts, and RNAseq: An Insider Look into GATK 3" (see that link for the full program), this webinar will be all about the GATK 3 features, of course. And lest you think this is just another marketing pitch (no offense, marketing people), rest assured that we will be diving into the gory technical details of what happens under the hood. This is a great opportunity to get the inside scoop on how the new features (RNAseq, GVCF pipeline etc) work -- all the stuff that's fit to print, but that we haven't had time to write down in the docs yet. So don't miss it if that's the sort of thing that floats your boat! Or if you miss it, be sure to check out the recording afterward.
As usual the webinar is completely free and open to everyone (not just Appistry customers or prospective for-profit users). All you need to do is register now and tune in on Thursday 4/10.
Talk to you then!
Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).
Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with
bwa mem) is this:
"QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"and we also filter out heterozygous positions using "isHet == 1".
--output_mode EMIT_ALL_SITES \and
--emitRefConfidence GVCF \
Does this sound like a reasonable thing to do? What options should I use in step 8 in order for
HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using
--output_mode EMIT_ALL_CONFIDENT_SITES \ and
--emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).
I'm having a problem using GATK's ApplyRecalibration tool using this command:
java -jar "GenomeAnalysisTK.jar" -T "ApplyRecalibration" -R human_g1k_v37.fasta -input fileContents.vcf --ts_filter_level "99.9" --mode "SNP" --recal_file dataset_1139.dat_0.recal --tranches_file dataset_1140.dat_0.tranches --out dataset_1142.dat_0.vcf
This is the error:
I tried to look for some solutions, I added a :NAME,VCF tag before the vcf file, I tried using another vcf file and I validated the vcf file I'm using, using GATK's ValidateVariants and it was validated with no problem, I tried different versions of GATK including the current version 3.3, and I tried compressing and indexing the vcf file unsing bgzip+tabix but this problem still presists.
The weird thing is that in the log info I get this message before the error: INFO 14:21:11,775 ArgumentTypeDescriptor - Dynamically determined type of fileContents.vcf to be VCF
Any suggestions are appreciated.
Thanks in advance.
Dear GATK team, http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.html “GenotypeGVCFs merges gVCF records that were produced as part of the reference model-based variant discovery pipeline (see documentation for more details) using the '-ERC GVCF' or '-ERC BP_RESOLUTION' mode of the HaplotypeCaller. This tool performs the multi-sample joint aggregation step and merges the records together in a sophisticated manner. At all positions of the target, this tool will combine all spanning records, produce correct genotype likelihoods, re-genotype the newly merged record, and then re-annotate it. Note that this tool cannot work with just any gVCF files - they must have been produced with the HaplotypeCaller, which uses a sophisticated reference model to produce accurate genotype likelihoods for every position in the target.”
The final sentence says that "accurate genotype likelihoods", how do you define "accurate"? May I ask if it is fully evaluated? If so, could you please tell me how you do the evaluation?
Thank you in advance.
I have a quick question about the results of an CombineGVCFs file while creating large background files. Prior to combining the files a region of the .gvcf file from HC looks like this.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR070473
13 19408518 . A <NON_REF> . . END=19409266 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
13 19409267 . T <NON_REF> . . END=19409323 GT:DP:GQ:MIN_DP:PL 0/0:4:12:2:0,6,60
13 19409324 . C <NON_REF> . . END=19409400 GT:DP:GQ:MIN_DP:PL 0/0:15:42:8:0,24,299
If you noticed the first individual looks like this SRR070473, with a 0/0:0:0:0:0,0,0 recorded, but after combining the file in batches of 200, the same information will be recorded as no call.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR070473 SRR070477 SRR070505 SRR070516 SRR070517 SRR070772 SRR070779 SRR070796 SRR0
13 19408518 . A <NON_REF> . . END=19408519 GT:DP:GQ:MIN_DP:PL ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
The issue I'm seeing is when you attempt to use GenotypeGVCFs I get the following an error similar to this if using the file containing no call notation.
##### ERROR MESSAGE: cannot merge genotypes from samples without PLs; sample ERR031932 does not have likelihoods at position 1:10929
Quick issue. I'm currently running VariantRecalibrator using the following command.
Program Args: -T VariantRecalibrator -R /data/srynearson/gatk_reference/human_g1k_v37_decoy.fasta --minNumBadVariants 5000 --num_threads 10 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /data/GATK_Bundle/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/GATK_Bundle/1000G_phase1.snps.high_confidence.b37.vcf -an QD -an:HaplotypeScore -an:MQRankSum -an:ReadPosRankSum -an:FS -input /data2/srynearson/UGP06/fastq/UGP06_genotyped.vcf -recalFile /data2/srynearson/UGP06/fastq/UGP06_snp_recal -tranchesFile /data2/srynearson/UGP06/fastq/UGP06_snp_tranches -rscriptFile /data2/srynearson/UGP06/fastq/UGP06_snp_plots.R -mode SNP
And getting the following error:
ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example).
This seems strange given the "-input /data2/srynearson/UGP06/fastq/UGP06_genotyped.vcf" file above has 100 CEU individuals + one target exome.
I will note that the genotyped.vcf file does only show ~280,000 variants.
I have tried increasing the minNumBadVariants and reducing maxGaussian, but the error either stays the same or changes to "Unable to retrieve result".
I have noticed the walker logs being written to stderr instead of stdout. Is this a new feature or a bug? If I could vote against it, then I would. Unless there is a plan in the future to have GATK adhere to the Unix philosophy like SAMtools currently does; i.e. to operate on piped streams.