Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.
-ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.
Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.
As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.
This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the
-ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the
-ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the
You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:
But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).
In a nutshell, phased records will look like this:
1 1372243 . T <NON_REF> . . END=1372267 <snip> <snip> 1 1372268 . G A,<NON_REF> . . <snip> GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip> 1 1372269 . G T,<NON_REF> . . <snip> GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip> 1 1372270 . C <NON_REF> . . END=1372299 <snip> <snip>
You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).
The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.
Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).
We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the
--recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.
Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.
GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.
--sample_nameargument. This is a shortcut for people who have multi-sample BAMs but would like to use
-ERC GVCFmode with a particular one of those samples.
--ignore_all_filtersoption. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.
--keepOriginalAC functionalityin SelectVariants to work for sites that lose alleles in the selection.
read_grouparguments no longer appear in the header.
--bcffor VCF files, and
--generate_md5for BAM files moved to the engine level.
I would like to know where / how could we download old versions of GATK, and specifically 2.7.x and 2.8.x. I would like to fix a bug occuring on our galaxy instance for teaching galaxy, but I do not want to update to 3.3 (in order to keep maximum wrapper compatibility).
Are those still available somewhere ?
Thank you for your work,
When I run GATK with identical settings on some amplicon sequencing data from MiSeq (150KB region), I get different numbers of variants (approximately 10% difference), even after setting -dfrac to 1. What is the cause for this variation? How to make results reproducible?
Thank you so much!
#8.1.vcf and 8.2.vcf are the raw VCFs from two runs with filters applied java -Xmx4g -jar gatk.jar -T CombineVariants -R human_g1k_v37.fasta -o 8merge_union.vcf -V 8.1.vcf -V 8.2.vcf grep Intersection 8merge_union.vcf > 8merge_intersection.vcf grep -v Intersection 8merge_union.vcf > 8merge_NOintersection.vcf grep PASS 8merge_NOintersection.vcf > 8merge_NOintersection.PASS.vcf wc -l *vcf 711 8.1.vcf 642 8.2.vcf 308 8merge_intersection.vcf 86 8merge_NOintersection.PASS.vcf 462 8merge_NOintersection.vcf 770 8merge_union.vcf
Please find 8.1.vcf and 8.2.vcf in attachment.
##realign java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T RealignerTargetCreator -I 8.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -known ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf -o 8.intervals -nt 6 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T IndelRealigner -I 8.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -targetIntervals 8.intervals --out 8.realign.bam -known ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf --consensusDeterminationModel USE_READS -LOD 0.4 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 # ##base recal java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T BaseRecalibrator -I 8.realign.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --default_platform ILLUMINA -knownSites ~public/project/seqlib/gatk/dbsnp_137.b37.vcf -knownSites ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf -nct 6 -o 8.recal_data -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 java -Xmx6g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T PrintReads -I 8.realign.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -o 8.recal.bam -BQSR 8.recal_data -nct 1 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 #variant calling java -Xmx8g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T HaplotypeCaller -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -I 8.recal.bam -o 8.raw.vcf --dbsnp ~public/project/seqlib/gatk/dbsnp_137.b37.vcf -nct 6 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 #variant filtering java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T SelectVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.vcf -o 8.raw.indel.vcf -selectType INDEL -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T SelectVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.vcf -o 8.raw.snp.vcf -selectType SNP -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 #use hard filtering due to small input file #SNP java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T VariantFiltration -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.snp.vcf --filterName QDFilter --filterExpression 'QD<2.0' --filterName FSFilter --filterExpression 'FS>60.0' --filterName MQFilter --filterExpression 'MQ<40.0' --filterName MaqQualRankSumFilter --filterExpression 'MappingQualityRankSum<-12.5' --filterName ReadPosFilter --filterExpression 'ReadPosRankSum<-8.0' -o 8.filter.snp.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 #indel java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T VariantFiltration -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.indel.vcf --clusterWindowSize 10 --filterExpression 'QD<2.0' --filterName QDFilter --filterExpression 'ReadPosRankSum<-20.0' --filterName ReadPosFilter --filterExpression 'FS>200.0' --filterName FSFilter -o 8.filter.indel.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T CombineVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.filter.snp.vcf --variant 8.filter.indel.vcf -o 8.filter.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1