I'm using GATK version 3.1-1 and following Best Practices to call SNPs and INDELs for my exome sequencing data. I noticed that the AD field didn't update properly after I merged per-sample GVCF files into the per-cohort VCF file.
As an example, one SNP on mitochondrion at location 185 has following entries in per-sample GVCF files:
chrM 185 . G T,
After merging with GenotypeGVCFs, the same SNP has the entry in the per-cohort VCF file: chrM 185 . G T,A 3561.78 . AC=8,4;AF=0.098,0.049;AN=82;DP=565;FS=0.000;InbreedingCoeff=0.8981;MLEAC=8,3;MLEAF=0.098,0.037;MQ=52.36;MQ0=0;QD=32.42 GT:AD:DP:GQ:PL 0/0:.:13:30:0,30,450,30,450,450 0/0:.:16:39:0,39,585,39,585,585 1/1:0,23,0:23:93:1288,93,0,1288,93,1288 0/0:.:8:21:0,21,315,21,315,315 0/0:.:14:27:0,27,405,27,405,405 0/0:.:11:24:0,24,360,24,360,360 0/0:.:2:6:0,6,75,6,75,75 0/0:.:9:18:0,18,270,18,270,270 0/0:.:8:21:0,21,269,21,269,269 0/0:.:18:36:0,36,540,36,540,540 0/0:.:10:21:0,21,315,21,315,315 0/0:.:23:60:0,60,900,60,900,900 0/0:.:45:99:0,102,1530,102,1530,1530 0/0:.:50:99:0,120,1800,120,1800,1800 0/0:.:8:21:0,21,266,21,266,266 0/0:.:16:36:0,36,540,36,540,540 1/1:0,7,0:7:33:444,33,0,444,33,444 0/0:.:5:6:0,6,90,6,90,90 0/0:.:4:9:0,9,135,9,135,135 0/0:.:7:21:0,21,249,21,249,249 0/0:.:12:27:0,27,405,27,405,405 0/0:.:23:60:0,60,900,60,900,900 2/2:0,5,0:5:21:293,293,293,21,21,0 0/0:.:8:21:0,21,278,21,278,278 0/0:.:11:24:0,24,360,24,360,360 0/0:.:8:24:0,24,292,24,292,292 2/2:0,3,0:3:18:239,239,239,18,18,0 0/0:.:12:27:0,27,405,27,405,405 0/0:.:12:21:0,21,315,21,315,315 1/1:0,21,0:21:99:1350,99,0,1350,99,1350 0/0:.:13:27:0,27,405,27,405,405 0/0:.:3:9:0,9,112,9,112,112 1/1:0,3,0:3:15:204,15,0,204,15,204 0/0:.:7:21:0,21,209,21,209,209 0/0:.:13:27:0,27,405,27,405,405 0/0:.:5:6:0,6,90,6,90,90 0/0:.:7:21:0,21,260,21,260,260 0/0:.:10:24:0,24,360,24,360,360 0/0:.:40:93:0,93,1395,93,1395,1395 0/0:.:38:84:0,84,1260,84,1260,1260 0/0:.:14:33:0,33,495,33,495,495
Both alternative alleles were included in the ALT field and the genotype field (GT) was updated correspondingly. For genotype 2/2, I think the correct AD field should be 0,0,5 and 0,0,3 in my aforementioned example. However, these fields remained the same as in the original per-sample GVCF files (i.e., 0,5,0 and 0,3,0). I'm wondering whether my understanding about the AD field is correct or not. And also, if it is an erroneous reporting, whether it'll influence the downstream VQSR and other filtering/annotations or not.
I am trying to filter some of my high-coverage samples based on a minimum depth and have found that the value stored in the DP INFO field and the AD genotype tag changes depending on whether or not I have run VariantAnnotator. The call I have used for VariantAnnotator is:
java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R ucsc.hg19.fasta -I example.bam --variant example.raw.vcf --out example.annotated.vcf -G StandardAnnotation -L example.raw.vcf -rf BadCigar -dcov 15000
Here are the differences for some test cases with HaplotypeCaller:
No MarkDuplicates, did IndelRealigner & BQSR, nightly build 12/04/2013
Annotated: DP=2745, AD=4,2729 Raw: DP=957, AD=1,907
MarkDuplicates, IndelRealigner and BQSR, nightly build 12/04/2013
Annotated: DP=20, AD=0,20 Raw: DP=10, AD=0,8
Raw BAM, nightly build 12/04/2013
Annotated: DP=2745,AD=4,2729 Raw: DP=868, AD=1,864
Raw BAM, version 2.4-9
Annotated: DP=2745, AD=4,2729 Raw: DP=616, AD=1,611
I suspect what is happening here is that VariantAnnotator is taking the depth information from the provided BAM and replacing the depth information reported by the variant caller. Anyway, just wondering- which value is a better reflection of the depth used to make a given variant call? (i.e. which could I use in hard filtering?)
Thanks for your help!
Calling SNPs using a single bam file with the command:
java -Xmx30g -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R ref.fasta \ -I input.bam \ -o output.vcf \
and when looking at the output file, most DP values were equal to the AD values and in few cases the AD value was higher. Thought that AD values are the unfiltered counts of all reads and DP fields describes the total depth of reads that passed the Unified genotyper’s internal quality control. Is it normal for the AD values to be higher than the DP value?
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT eo227 gi|218430358|emb|CU928163.2| 2317180 . T G 76.55 . AC=1;AF=0.50;AN=2;BaseQRankSum=1.568;DP=10;Dels=0.00;FS=11.181;HRun=0;HaplotypeScore=15.8585;MQ=28.63;MQ0=0;MQRankSum=1.036;QD=7.66;ReadPosRankSum=-0.633;SB=-0.01 GT:AD:DP:GQ:PL 0/1:6,4:10:99:107,0,154 gi|218430358|emb|CU928163.2| 2317181 . T G 71.96 . AC=1;AF=0.50;AN=2;BaseQRankSum=0.550;DP=10;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=19.8574;MQ=28.63;MQ0=0;MQRankSum=-1.754;QD=7.20;ReadPosRankSum=-1.754;SB=-0.01 GT:AD:DP:GQ:PL 0/1:3,4:10:87.90:102,0,88