The answer depends on what tool we're talking about, and whether we're considering variant discovery or variant manipulation.
GATK variant manipulation tools are able to recognize the following types of alleles:
Of our two variant callers, UnifiedGenotyper is the more limited, as it only calls SNPs and indels, and does so separately (even if you run in calling mode BOTH, the program performs separate calling operations internally). The HaplotypeCaller is more sophisticated and calls different types of variants at the same time. So in addition to SNPs and indels, it is capable of emitting mixed records by default. It is also capable of emitting MNPs and symbolic alleles, but the modes to do so are not enabled by default and they are not part of our recommended best practices for the tool.
The GATK currently does not handle SVs (structural variations) or CNVs (copy number variations), but there are some third-party software packages built on top of GATK that provide this functionality. See GenomeSTRiP for SVs and XHMM for CNVs.
we are working with GATK 3.2.2 and focus on an optimization of the SNP and indel calling in the context of our data. Thereby, we also determine the frequency of a certain variant. However, we came across some strange results in this case. In the vcf files there is of course the DP in the info field, representing the unfiltered depth, and the DP in the format field, representing the filtered depth. Yet, we also calculated the depth by ourselves in the bam file and looked it up in the IGV. One especially strange case is the following: For a sample GATK calls a deletion. The unfiltered DP (info field) is DP=644. The filtered DP (format field) is DP=505. It consists of 438 reads with the original genotype and 67 reads containing the variant. However, in our raw bam file and in the IGV things are different. There we see 499 reads in total. 425 containing a C (reference), one with an N and 73 with a deletion. We checked and the results are reproduceable. Furthermore, we see similar differences as well. There are many cases in which the number of reads containing the variant or the reference allele as we count it and as it is observable in the IGV is smaller than the reported number of filtered reads. How can this happen? I'm really looking forward to your answers.
I have WES data of 235 samples, aligned using bwa aln. I followed GATK's Best Pratice for marking duplicates, realignment around INDEL, and BQSR until I got the recalibrated bamfiles. All these are done with GATK 2.7-2. After that I generated gvcf file from the recalibrated bamfiles, using HaplotypeCaller from GATK 3.1-1, followed by GenotypeGVCFs and VQSR.
I came across one INDEL, which passes VQSR,
The variant is chr14 92537378 . G GC 11330.02 PASS AC=11;AF=0.023;AN=470;BaseQRankSum=0.991;ClippingRankSum=-6.820e-01;DP= 13932;FS=0.000;GQ_MEAN=137.20;GQ_STDDEV=214.47;InbreedingCoeff=-0.0311;MLEAC=11;MLEAF=0.023;MQ=43.67;MQ0=0;MQRankSum=-1.073e+00;NCC=0;QD=18.16; ReadPosRankSum=-1.760e-01;VQSLOD=3.70;culprit=FS
Indivudual genotypes seem very good. To name a few: 1800 0/0:65,0:65:99:0,108,1620 0/0:86,0:86:99:0,120,1800 0/1:23,25:48:99:1073,0,1971 0/1:51,39:90:99:1505,0,4106
But when I checked the variants in IGV, something strange popped out.
The pictures (and complete post) can be view here: http://alittlebitofsilence.blogspot.hk/2014/09/to-be-solved-gatk-haplotypecaller-indel.html
Judging from the alignment visualized by IGV, there are no insertions at that site, but an SNP in the next position, in all samples. But HaplotypeCaller calls G->GC insertion in some samples while not in other samples. My questions are: 1) In variant calling HaplotypeCaller would do a local de-novo assembly in some region. Is the inconsistency between bamfiles and gvcf because of the re-assembly? 2) Why is the insertion called in some samples but not others, although the position in all samples looks similar? 3) There are other variants called adjacent or near this site, but later filtered by VQSR. Does that indicate the variants from this region cannot be trusted? chr14 92537353 . C CG 303342.41 VQSRTrancheINDEL0.00to90.00+ chr14 92537354 . C CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG 142809.26 VQSRTrancheINDEL0.00to90.00+ chr14 92537378 . G GC 11330.02 PASS chr14 92537379 rs12896583 T TGCTGCTGCTGCTGC,C 13606.25 VQSRTrancheINDEL0.00to90.00+ chr14 92537385 rs141993435 CTTT C 314.64 VQSRTrancheINDEL0.00to90.00+ chr14 92537387 rs12896588 T G 4301.60 VQSRTrancheSNP99.90to100.00 chr14 92537388 rs12896589 T C 4300.82 VQSRTrancheSNP99.90to100.00 chr14 92537396 . G GC,GCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGC 4213.61 VQSRTrancheINDEL0.00to90.00+
chr14 92537397 . T TGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,TGCTGCTGCTGCTG,TGCTGCTGCTG,TGCTGCTG 2690.79 VQSRTrancheINDEL0.00to90.00+ 3) I used GATK 2.7 for processing before calling gvcf files. Any possibility that if I change to 3.2, the bamfiles would look more similar to gvcf result? I have tried GATK3.2-2 for generating gvcf from GATK 2.7 bamfiles, the results are the same.
According to the definition of QD, it equals to the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.
However, when I ran the Unified Genotyper (v2.8.1) on a pair of samples for indel calls, the QD score for some indels are much lower than expected. For example,
1) chrX 51075762 . GCAGCTGCGT G 142.15 . AC=1;AF=0.250;AN=4;BaseCounts=0,0,204,0;BaseQRankSum=-1.435;DP=204;FS=0.000;GC=70.82;LowMQ=0.0049,0.0049,204;MLEAC=1;MLEAF=0.250;MQ=55.32;MQ0=0;MQRankSum=-2.698;QD=0.15;ReadPosRankSum=-3.784;Samples=tumor_DNA_2C;VariantType=DELETION.NumRepetitions_1.EventLength_9 GT:AD:DP:GQ:PL 0/0:94,0:95:99:0,283,11503 0/1:101,7:108:99:181,0,11532
in this case, the expected QD = 142.15 / 108 = 1.32, but the reported QD = 0.15
2) chr21 45649571 . C CCTGGACCCGCCCTGGACACCCCACGGGGG 526.15 . AC=1;AF=0.250;AN=4;BaseCounts=0,78,0,0;BaseQRankSum=0.787;DP=78;FS=0.000;GC=65.84;LowMQ=0.0000,0.0000,78;MLEAC=1;MLEAF=0.250;MQ=60.33;MQ0=0;MQRankSum=1.453;QD=0.44;ReadPosRankSum=-3.875;Samples=tumor_DNA_2C;VariantType=INSERTION.NOVEL_10orMore GT:AD:DP:GQ:PL 0/0:31,1:37:86:0,86,3672 0/1:30,6:41:99:565,0,4204
the expected QD = 526.15 / 41 = 12.83, the reported QD 0.44 is much smaller.
Do you know what's happening? Should I filter those indel calls (since QD < 2.0)?
first of all thanks for the support offered here. But (sadly) I have an other two questions.
I used the FastaAlternateReferenceMaker to create a consesus sequence and it worked great. Here I have a question regarding a warning: "IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation". I do not get this warning. Which dictionary is needed here. Is this warning comming up, because I used .gz and .tbi compressed vcf-files?
And a furhter question regarding the creation of a consesus sequence. I`m unsure what to do in cases of heterozygous INDELs. What would you suggest to do? Skip them as they are only represented by one allele or include them in the consensus. Both seems to be incorrect. Do you know the general handling here?
I have narrowed down this issue to UG Indel caller. I am currently running UG will an additional 136 1000Genome individuals to be call at the same time my variants of interest. I run my SNP and Indel calls separate as per GATK best practices. But have noticed this issue only with my Indel calls.
This is the command I'm running:
java -Xmx32g -jar -Djava.io.tmpdir=/tmp/ /home/srynearson/GenomeAnalysisTK-nightly-oct/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /home/srynearson/cAPTUrE/references/human_g1k_v37.fasta -I [each of the 1000Genome .bam files] -I [each of my variants of interest .bam files] --num_threads 15 --standard_min_confidence_threshold_for_calling 30.0 --standard_min_confidence_threshold_for_emitting 30.0 --output_mode EMIT_VARIANTS_ONLY --num_cpu_threads_per_data_thread 2 --genotype_likelihoods_model INDEL
And this is the error I get:
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Alignment 88304365 | 67S24M1D8M1S at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:574) at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:438) at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(ReadUtils.java:434) at org.broadinstitute.sting.gatk.walkers.annotator.BaseQualityRankSumTest.getElementForRead(BaseQualityRankSumTest.java:76) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.getElementForRead(RankSumTest.java:200) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.fillQualsFromLikelihoodMap(RankSumTest.java:179) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.annotate(RankSumTest.java:102) at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateGenotypes(UnifiedGenotyperEngine.java:560) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:234) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)
I have confirmed prior files and the SNP file runs to completion.
Any ideas? Thanks!
I have Ion Torrent data. I am trying to call a variant that I know to exist (confirmed with Sanger). In the position where there is the known indel, I have a depth of roughly 80-90 (in two different runs) and a of those between 20-23% of the reads have the insertion called. What parameters should I be adjusting to get this indel to call? I don't mind a large number of false positives.
I've tried several iterations that include indel realignment using known indels (1000G_phase1 and ills_and_1000G_gold_standard) and also excluding them. I have also tried iterations of setting these flags in UnifiedGenotyper:-stand_call_conf 30.0 -stand_emit_conf 0.0 --min_base_quality_score 0 -glm BOTH --dbsnp dbsnp_137.b37.vcf -nt -rf BadCigar -minIndelCnt 3 -minIndelFrac 0.15. I have also attempted to use HaplotypeCaller: -stand_call_conf 30.0 -stand_emit_conf 0.0 --dbsnp dbsnp_137.b37.vcf -rf BadCigar
Any suggestions would be great.
I'm having problems understanding a GATK output VCF. I have read the VCF standard, but I'm obviously missing something.
I /think/ I understand how SNPs and short indels are represented, but clearly I do not. Below is an excerpt that illustrates sites which I do not understand. I suspect it may be something to do with GATK quality filters that I'm not understanding, or something about using EMIT_ALL_SITES...
The excerpt below was generated using
GATK -l INFO -I my.bam -R my.fa -T UnifiedGenotyper -S LENIENT -nt 8 --heterozygosity 0.1 -o test.vcf --genotype_likelihoods_model BOTH --min_base_quality_score 10 --output_mode EMIT_ALL_SITES -ploidy 2
CH1 225 . T G 12.71 LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.978;DP=59;Dels=0.03;FS=0.000;HaplotypeScore=10.2840;MLEAC=1;MLEAF=0.500;MQ=70.25;MQ0=8;MQRankSum=-5.349;QD=0.22;ReadPosRankSum=-3.188 GT:AD:DP:GQ:PL 0/1:41,16:55:20:20,0,1435 CH1 226 . T . 121.53 . AN=2;DP=59;MQ=70.25;MQ0=8 GT:DP 0/0:43 CH1 227 . A . 121.53 . AN=2;DP=59;MQ=70.25;MQ0=8 GT:DP 0/0:43 CH1 228 . T . 121.53 . AN=2;DP=59;MQ=70.25;MQ0=8 GT:DP 0/0:43 CH1 229 . A . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:38 CH1 230 . C . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:38 CH1 231 . T . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:36 CH1 232 . G . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:36 CH1 233 . C . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:37 CH1 234 . A . 139.53 . AN=2;DP=70;MQ=59.20;MQ0=14 GT:DP 0/0:63 CH1 235 . A . 175.53 . AN=2;DP=84;MQ=51.67;MQ0=15 GT:DP 0/0:79 CH1 236 . A . 175.53 . AN=2;DP=84;MQ=51.67;MQ0=15 GT:DP 0/0:79 CH1 237 . T . 175.53 . AN=2;DP=85;MQ=51.37;MQ0=16 GT:DP 0/0:80 CH1 238 . A . 175.53 . AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP 0/0:97 CH1 238 . A AGAAAGAAAGCTTGTA 83.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=6.172;DP=102;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=46.90;MQ0=0;MQRankSum=-6.190;QD=0.05;ReadPosRankSum=-5.733 GT:AD:DP:GQ:PL 0/1:27,25:57:99:121,0,4853 CH1 239 . A . 175.53 . AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP 0/0:101 CH1 240 . T . 175.53 . AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP 0/0:98 CH1 241 . A . 169.53 . AN=2;DP=108;MQ=44.14;MQ0=29 GT:DP 0/0:107 * CH1 242 . T . 169.53 . AN=2;DP=109;MQ=43.94;MQ0=29 GT:DP 0/0:103 * CH1 242 . T . 118.27 . AN=2;DP=109;MQ=43.94;MQ0=29 GT:AD:DP 0/0:27:55 * CH1 243 . C . 172.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:108 * CH1 243 . CTTTT . 118.27 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:AD:DP 0/0:27:56 CH1 244 . T . 91.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:61 CH1 245 . T . 91.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:53 CH1 246 . T . 73.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:41 CH1 247 . T . 91.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:46 CH1 248 . A . 172.53 . AN=2;DP=116;MQ=42.61;MQ0=31 GT:DP 0/0:100 CH1 249 . A . 172.53 . AN=2;DP=116;MQ=42.61;MQ0=31 GT:DP 0/0:100 CH1 250 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:101 CH1 251 . T . 169.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:96 CH1 251 . T . 118.27 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:AD:DP 0/0:27:56 CH1 252 . C . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:113 CH1 253 . C . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:110 CH1 254 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:111 CH1 255 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:111 CH1 256 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:111
Line 1 is a SNP
Lines 14 and 15 are an indel that I do understand
Lines 19 and 20 I do /not/ understand
Lines 21 and 22 I do /not/ understand
Just thought I should report a possible small bug with SelectVariant --maxIndelSize as I think it's only filtering insertions greater than the given value and not deletions. Of course using JEXL expressions I can still get the variants I'm after...