I'm doing a variant analysis of genomic DNA from 2 related samples. I followed the up-to-date Best practices using HaplotypeCaller in GVCF mode for both samples followed by GenotypeGVCF to compute a common vcf of variant loci. I'm looking at variants that would be sample2-specific (present in sample2 but not in sample1)
Here is a line of this file:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 chrIII 91124 . A AATAAGAGGAATTAGGCT 1132.42 . AC=2;AF=0.500;AN=4;DP=47;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=58.85;MQ0=0;QD=7.99 GT:AD:DP:GQ:PL 1/1:0,25:25:55:1167,55,0 0/0:.:22:33:0,33,495
In the Genotype Field, sample2.AD is a . (dot) meaning that no reads passed the Quality filters. However, sample2.DP=22 meaning that 22 reads covered this position. This line suggest that this variation is specific to sample1 (genotype HomVar 1/1) and is not present in sample2 (HomRef 0/0). But given the biological relationship between sample1 and 2 (the way they were generated), I doubt that this variation is true: it is very likely to be present in sample2 as well. It's a false
I have 416 loci like this. For the vast majority of them, sample1 and 2 likely share the same variation. But since it is not impossible that a very few of them are really sample1=HomVar and sample2=HomRef, could you suggest me a way to detect those guys? What about comparing sample1.PL(1/1) and sample2.PL(0/0) ? For example could you suggest a rule of thumb to determine their ratio ?
Hi, GATK Team
I've run into a strange case where a SNP called by HaplotypeCaller has been represented as if it were an indel:
6 16327909 . ATGCTGATGCTGC CTGCTGATGCTGC 1390.70 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=0.040;AN=2;AN_Orig=50;BaseQRankSum=0.788;DP=10;FS=6.154;InbreedingCoeff=0.1807;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358;VQSLOD=2.78;culprit=FS GT:DP:GQ:PL 0/1:10:70:284,0,214
This VCF entry (for a single individual) comes from a multi-sample VCF that has multiple alternate "alleles" at that position:
6 16327909 . ATGCTGATGCTGC ATGC,CTGCTGATGCTGC,A 1390.70 . AC=12,3,1;AF=0.024,6.048e-03,2.016e-03;AN=496;BaseQRankSum=0.788;DP=2791;FS=6.154;InbreedingCoeff=0.1807;MLEAC=13,3,1;MLEAF=0.026,6.048e-03,2.016e-03;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358 GT:AD:DP:GQ:PL
However, this mode of representing a SNP is causing processing and analysis problems further downstream after I've split the multi-sample VCF into individual files. Is there a way to fix this problem such that variants are listed in the most parsimonious (and hopefully standard) way?
generated with gatk 2.8-1-g932cd3a
Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):
CHROM POS ID REF ALT FILTER QUAL ABHet ABHom AC AF AN BaseCounts BaseQRankSum DP Dels FS GC HRun HaplotypeScore LowMQ MLEAC MLEAF MQ MQ0 MQRankSum MeanDP MinDP OND PercentNBaseSolid QD ReadPosRankSum Samples Somatic VariantType cosmic.ID 1.AB 1.AD 1.DP 1.F 1.GQ 1.GT 1.MQ0 1.PL 1.Z 2.AB 2.AD 2.DP 2.F 2.GQ 2.GT 2.MQ0 2.PL 2.Z 3.AB 3.AD 3.DP 3.F 3.GQ 3.GT 3.MQ0 3.PL 3.Z 4.AB 4.AD 4.DP 4.F 4.GQ 4.GT 4.MQ0 4.PL 4.Z 5.AB 5.AD 5.DP 5.F 5.GQ 5.GT 5.MQ0 5.PL 5.Z 11 92616485 0 A C PASS 63.71 0.333 0.698 1 0.1 10 89,54,0,0 -5.631 143 0 49.552 71.29 2 4.4154 0.0000,0.0000,143 1 0.1 50.27 0 -1.645 28.6 16 0.242 0 2.36 2.125 R5_A3_1 NA SNP COSM467570 NA 24,9 33 0.2727272727 54 A/A 0 0,54,537 -1.3055824197 0.33 9,18 27 0.6666666667 96 A/C 0 96,0,178 0.8660254038 NA 21,11 32 0.34375 21 A/A 0 0,21,466 -0.8838834765 NA 12,4 16 0.25 27 A/A 0 0,27,272 -1 NA 23,12 35 0.3428571429 42 A/A 0 0,42,537 -0.9296696802
This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....