Tagged with #dp
1 documentation article | 0 announcements | 16 forum discussions



Created 2014-10-17 19:19:39 | Updated 2015-07-06 13:44:00 | Tags: coveragebysample depthofcoverage depthperallelebysample ad dp
Comments (0)

Overview

This document describes the proper use of metrics associated with depth of coverage for the purpose of evaluating variants.

The metrics involved are the following:

  • DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
  • Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

For an overview of the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?", please see this document.


Coverage annotations: DP and AD

The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:

  • DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
  • Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.

The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.

Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.


TO BE CONTINUED...

No posts found with the requested search criteria.

Created 2015-07-29 13:43:11 | Updated | Tags: dp call-snp eval
Comments (3)

In the following small table you can see the relationship of SNP-calls with coverage_depth (DP) for 34 samples. SNP calls extracted from Haplotype_caller module of GATK. As it can bee seen by increasing DP the number of SNP-calls to be decreased. However, my expectation was in opposite. Is there any explanation? Thanks

Call-SNP DP 39263 3 34950 4 21228 5 17878 6 13249 7 10913 8 8891 9 7484 10 6126 11 5165 12 4336 13 3807 14 3240 15 2975 16 2636 17 2347 18 2222 19 2046 20


Created 2015-05-22 08:00:24 | Updated 2015-05-22 08:01:11 | Tags: haplotypecaller dp
Comments (9)

Hi,

currently I'm running GATK version 3.3.0 and run in the problem, that quite often I don't get a value for DP after the HaplotypeCaller. There actually is nothing (just a dot) in the VCF file.

chr11 57569573 . A ACC 1640.8 VQSRTrancheINDEL99.90to100.00 . GT:AD:DP:GQ:PGT:PID:PL 0/1:57,16:.:99:0|1:57569569_CAGG_C:480,0,7262 chr11 57569575 . A AT 1686.05 VQSRTrancheINDEL99.90to100.00 . GT:AD:DP:GQ:PGT:PID:PL 0/1:60,16:.:99:0|1:57569569_CAGG_C:492,0,7092

What woks is to annotate afterwards using the VariantAnnotator. But in the documentation it states, that the VariantAnnotator defines the depth by the total amount of reads at that specific position. In contrast the DP after HaplotypeCaller should be the number of informative reads for that position. I can find that difference when I check for the position in GEMINI prior and after Annotation with Variant annotator.

Patient 1 (after reannotation using VariantAnnotator) Position depth variant 57563990 75 C/C 57569568 81 CAGG/C 57569572 78 A/ACC 57569574 81 A/AT

Patient 2 (after reannotation using VariantAnnotator) Postition depth variant 57563990 73 C/T 57569568 104 CAGG/CAGG 57569572 106 A/A 57569574 106 A/A

After HaplotypeCaller Postion patient1 patient2 ...(other patients) 57563990 48 -1 -1 -1 -1 -1 C/C C/T C/T C/T C/T C/T 57569568 -1 66 52 -1 56 60 CAGG/C CAGG/CAGG CAGG/CAGG CAGG/C CAGG/CAGG CAGG/CAGG 57569572 -1 101 88 -1 101 77 A/ACC A/A A/A A/ACC A/A A/A 57569574 -1 97 87 -1 94 78 A/AT A/A A/A A/AT A/A A/A 57576040 -1 -1 -1 -1 -1 8 ./. ./. ./. ./. ./. C/C 57578937 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. TTG/T 57578941 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. C/CTG 57578942 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. C/G

E.g. patient 2 has in the position 57569568 66 informative reads, but 104 reads in total. To check how valid a Variant is, I think it is way more interesting to know the number of informative reads.

In this specific case we checked for those mutation (57569572 and 57569574) via sanger and didn't find anything. If we would know that only 2 or 3 reads would be informative we might have gone to another variant instead.

Am I doing something wrong, or is there a bug in the software?

One example how I call the HaplotypeCaller.

java -Xmx10g -jar /data/ngs/bin/GATK/3.3-0/GenomeAnalysisTK.jar -l INFO -et NO_ET \ -T HaplotypeCaller \ -R /data/ngs/programs/bundle_2.8/ucsc.hg19.fasta \ -D /data/ngs/programs/bundle_2.8/dbsnp_138.hg19.vcf \ -L /data/ngs/exome_targets/SureSelect_V4_S03723314_Regions.bed \ -ERC GVCF \ -variant_index_type LINEAR \ -variant_index_parameter 128000 \ -stand_call_conf 30 \ -stand_emit_conf 30 \ -A Coverage -A AlleleBalance \ -A QualByDepth -A HaplotypeScore \ -nct 4 \ -I out/06-BQSR/21_383_1.bam \ -o out/08-variant-calling/21_383_1/21_383_1-raw.vcf \ -log out/08-variant-calling/21_383_1/21_383_1-raw_calling.log

Thanks in advance,

Anselm


Created 2015-03-09 16:48:43 | Updated | Tags: dp gt genotypegvcfs
Comments (7)

Hi,

Sometimes when a sample has no reads, it still can have a genotype. I used GenotypeGVCFs (without CombineGVCFs) to generate a vcf with several samples. If I extract data for one sample, I sometimes see missing genotype with no read :

GT:AD:DP:GQ:PGT:PID:PL  ./.:0,0:0:.:.:.:.

which seem ok to me. Note that DP = 0.

but more rarely, I see a genotype with no read :

GT:AD:DP:GQ:PGT:PID:PL  0/0:0,0:.:3:.:.:0,3,45
GT:AD:DP:GQ:PGT:PID:PL  1/1:0,0:.:3:1|1:32486973_A_AG:45,3,0

and this is more questionnable. Note that DP = "."

Can you explain how genotype can be called without reads ? (I suspect haplotype to be involved, but without read I do not understand) any idea about the DP value ? why has it a value with missing genotype ?

thank you.


Created 2015-03-08 09:52:48 | Updated | Tags: haplotypecaller dp
Comments (12)

Hello GATK team,

  1. I've noticed a strange variant in the gvcf output of HC:

Raw vcf: 6 32552140 rs17882663 T A, 108.18 . DB;DP=0;MLEAC=2,0;MLEAF=1.00,0.00;MQ=0.00;MQ0=0 GT:GQ:PGT:PID:PL:SB 1/1:9:0|1:32552059_G_T:135,9,0,135,9,135:0,0,0,0

After GenortpeGVCFs: 6 32552140 rs17882663 T A 107.28 . AC=2;AF=1.00;AN=2;DB;DP=0;FS=0.000;GQ_MEAN=9.00;MLEAC=2;MLEAF=1.00;MQ=0.00;MQ0=0;NCC=0;SOR=0.693 GT:AD:GQ:PGT:PID:PL 1/1:0,0:9:1|1:32552059_G_T:135,9,0

The DP and AD are 0, but there is a variant - A. What do you think? Why does it happened?

  1. What is the difference between the DP in the format part and in the info part? I looked for an answer in the documentation, but couldn't find one. Bellow is an example of a big difference between the values of this two. 3 195511214 . G GACCTGTGGATGCTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGTGTC 673.77 . AC=1;AF=0.500;AN=2;DP=169;FS=0.000;GQ_MEAN=38.00;MLEAC=1;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:0:38:702,0,38

Maya


Created 2015-02-12 13:25:05 | Updated | Tags: snp dp indel allele-depth
Comments (15)

Hey,

we are working with GATK 3.2.2 and focus on an optimization of the SNP and indel calling in the context of our data. Thereby, we also determine the frequency of a certain variant. However, we came across some strange results in this case. In the vcf files there is of course the DP in the info field, representing the unfiltered depth, and the DP in the format field, representing the filtered depth. Yet, we also calculated the depth by ourselves in the bam file and looked it up in the IGV. One especially strange case is the following: For a sample GATK calls a deletion. The unfiltered DP (info field) is DP=644. The filtered DP (format field) is DP=505. It consists of 438 reads with the original genotype and 67 reads containing the variant. However, in our raw bam file and in the IGV things are different. There we see 499 reads in total. 425 containing a C (reference), one with an N and 73 with a deletion. We checked and the results are reproduceable. Furthermore, we see similar differences as well. There are many cases in which the number of reads containing the variant or the reference allele as we count it and as it is observable in the IGV is smaller than the reported number of filtered reads. How can this happen? I'm really looking forward to your answers.

Sarah


Created 2015-02-02 14:33:08 | Updated | Tags: dp qual qd
Comments (10)

Hello Forum Members and GATK Team, I am trying to understand the calculations that goes into QD in GATK.

In the following its mentioned that the new versions calculate QD using AD: https://www.broadinstitute.org/gatk/blog?id=3862

In joint VCF file I have following entry: 1 5474238 . C A 163.45 PASS AC=2;AF=0.071;AN=28;DP=609;FS=0.000;MQ=60.00;MQ0=0;QD=27.05;Samples=Sample1;VQSLOD=18.21;VariantType=SNP;culprit=MQ GT:AD:GQ:PL 1/1:0,3:12:175,12,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,00/0:0,0:0:0,0,0

In the individual GVCF files (all 14) I search for Chromosome 1 and Position 5474238, and there only one entry: 1 5474238 . C A, 142.28 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:12:175,12,0,175,12,175:0,0,2,1

Here the QUAL is 142.28 and AD is 3. How come in the joint VCF file, the QUAL becomes 163.45 and QD=27.05 with DP=609?

Above is simple example where only in one sample a GT was called that was different from other 0/0.

Lets take another example (bit complicated): Joint VCF file entry: 1 24089531 . A G 882.89 PASS AC=13;AF=0.464;AN=28;BaseQRankSum=0.358;DP=159;FS=4.751;MQ=60.00;MQ0=0;MQRankSum=0.358;QD=10.77;ReadPosRankSum=0.00;Samples=Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7,Sample8,Sample9,Sample10;VQSLOD=13.36;VariantType=SNP;culprit=DP GT:AB:AD:GQ:PL 1/1:.:0,3:7:48,7,0 0/1:0.600:6,4:50:50,0,117 0/1:0.670:4,2:31:31,0,87 0/1:0.500:3,3:40:62,0,40 0/1:0.200:1,4:12:68,0,12 0/0:.:6,0:18:0,18,155 0/1:0.790:11,3:38:38,0,230 1/1:.:0,15:44:437,44,0 0/0:.:4,1:8:0,8,91 0/0:.:0,0:0:0,0,0 0/0:.:0,0:0:0,0,0 0/1:0.670:4,2:35:35,0,91 1/1:.:0,1:3:11,3,0 0/1:.:3,2:38:38,0,49

Entires in the GVCF files: Sample1 1 24089531 . A G, 16.33 . DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:7:48,7,0,48,7,48:0,0,2,1 Sample2 1 24089531 . A G, 21.80 . BaseQRankSum=0.365;DP=12;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.741;ReadPosRankSum=-0.365 GT:AB:AD:GQ:PL:SB 0/1:0.600:6,4,0:50:50,0,117,68,128,196:6,0,3,1 Sample3 1 24089531 . A G, 38.03 . BaseQRankSum=-0.731;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AB:AD:GQ:PL:SB 0/1:0.200:1,4,0:12:68,0,12,71,24,95:1,0,3,1 Sample4 1 24089531 . A G, 0 . DP=7;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 0/0:6,0,0:18:0,18,155,18,155,155:6,0,0,0 Sample5 1 24089531 . A G, 4.61 . BaseQRankSum=-0.720;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.720;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:31:31,0,87,43,93,136:4,0,2,0 Sample6 1 24089531 . A G, 32.77 . BaseQRankSum=0.406;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.988;ReadPosRankSum=-0.406 GT:AB:AD:GQ:PL:SB 0/1:0.500:3,3,0:40:62,0,40,70,49,120:2,1,3,0

Sample7 1 24089531 . A G, 10.20 . BaseQRankSum=-0.156;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.778;ReadPosRankSum=-0.467 GT:AB:AD:GQ:PL:SB 0/1:0.790:11,3,0:38:38,0,230,70,239,309:11,0,3,0

Sample8 1 24089531 . A G, 10.20 . BaseQRankSum=-0.156;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.778;ReadPosRankSum=-0.467 GT:AB:AD:GQ:PL:SB 0/1:0.790:11,3,0:38:38,0,230,70,239,309:11,0,3,0

Sample9 1 24089531 . A G, 0 . BaseQRankSum=-0.731;DP=12;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AD:GQ:PL:SB 0/0:4,1,0:8:0,8,91,12,93,97:4,0,0,0

Sample14

Sample13

Sample10 1 24089531 . A G, 7.60 . BaseQRankSum=-1.380;DP=9;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.380;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:35:35,0,91,47,97,144:4,0,2,0

Sample11 1 24089531 . A G, 0.05 . DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,1,0:3:11,3,0,11,3,11:0,0,0,0

Sample12 1 24089531 . A G, 9.31 . BaseQRankSum=0.358;DP=7;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.358;ReadPosRankSum=-0.358 GT:AD:GQ:PL:SB 0/1:3,2,0:38:38,0,49,47,55,102:0,0,0,0

Using AD only, I am still not able to the QD value or the QUAL values in the Joint VCF files.

GATK Version=3.2-2-gec30cee

Thanks in Advance.


Created 2014-12-12 12:53:50 | Updated | Tags: ad dp genotypegvcfs
Comments (7)

Hi,

I am running the GATK 3.2-2 pipeline and I am getting some odd results from the genotype VCF stage. It appears that incorrect AD results are being output.

E.g. on the single sample gvcf I have:

GC G, 522.73 . BaseQRankSum=-0.716;ClippingRankSum=0.219;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=66.41;MQ0=0;MQRankSum=1.681;ReadPosRankSum=-1.155 GT:AD:DP:GQ:PL:SB 0/1:20,18,0:38:99:560,0,633,620,687,1307:10,10,10,8

This agrees with what I see in IGV for this site (see attached image).

However, when I joint call this site along with other gvcfs, the output for that sample looks like:

0/1:20,0:38:99:560,0,633

i.e. It appears to not output the expected 20,18 - but it still seems to be making the correct genotype call so I am assuming it is just an output bug and not something that is affecting calling, but it would be good to know for sure. Either way it is problematic as it is affecting our downstream depth filtering. This isn't the only example of this happening in our data. I am trying to identify more by looking for occasions where the AD fields sum to less than DP, am I right in thinking this should never be the case as AD is unfiltered compared to the filtered DP?

Thanks a lot

Dan


Created 2014-07-22 21:35:03 | Updated | Tags: dp haplotype-caller
Comments (6)

I have run haplotype caller using gatk 3.2. I used Haplotype Caller on 4 samples, then GVCF on the 4 samples combined (commands below). I used select variants to randomly sample a portion of the SNPs for one of those 4 samples. When I look in IGV I notice a lot of strange results.

For example a clearly tri allelic site is called biallelic in the vcf (see attached) chr01 102445496 . A G 115.24 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.28;ClippingRankSum=0.124;DP=19;FS=0.000;GQ_MEAN=54.50;GQ_STDDEV=52.80;MQ=56.03;MQ0=0;MQRankSum=0.406;NCC=0;QD=4.61;ReadPosRankSum=0.406 GT:AD:DP:GQ:PL 0/1:8,7:19:99:220,0,248

Another error, which is common, is that the DP field is wrong - the counts it lists are incorrect (see attached) e.g. The depth here is listed as 5,23 but IGV shows that it is really 5,1 chr07 28268592 . C A 842.02 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.179e+00;ClippingRankSum=0.240;DP=28;FS=49.925;GQ_MEAN=44.50;GQ_STDDEV=62.93;MQ=60.00;MQ0=0;MQRankSum=0.300;NCC=2;QD=30.07;ReadPosRankSum=1.02 GT:AD:DP:GQ:PL 0/1:5,23:28:89:867,0,89

The base qualities of all the bases that are not N, are >30. Duplicate reads have been removed. I know other users had issues with DP when coverage is high, but these samples are ~15-25x.

SNP calling commands

java -Xmx8G -jar ${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} \ -T HaplotypeCaller \ -I ${in_dir}/"36E_"${REGION}""${SAMPLE_ABB}"_lowqNrecalibrated_3.bam" \ --min_base_quality_score 30 \ --output_mode EMIT_ALL_SITES \ --intervals $sites_dir/$file \ -o ${out_dir}/"37F"${REGION}"_"${SAMPLE_ABB}"_HapCal.vcf" \

java -Xmx4G -jar ${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} -T GenotypeGVCFs\ --variant ${out_dir}/"37F_"${REGION}""${SAMPLE1}"_HapCal_gVCF.vcf" \ --variant ${out_dir}/"37F"${REGION}""${SAMPLE2}"_HapCal_gVCF.vcf" \ --variant ${out_dir}/"37F"${REGION}""${SAMPLE3}"_HapCal_gVCF.vcf" \ --variant ${out_dir}/"37F"${REGION}""${SAMPLE4}"_HapCal_gVCF.vcf" \ -allSites \ --intervals $sites_dir/$file \ -o ${out_dir}/"38F"${REGION}"_"${FOUR}"_HapCal_includeNonVariantSites.vcf" \


Created 2014-02-02 18:34:10 | Updated | Tags: combinevariants dp
Comments (9)

Hi team,

I've been having some issues with DP following CombineVariants:

I have two vcf files called by different callers - one by GATK UnifiedGenotyper and the other by samtools mpileup.

When I merge the files using CombineVariants, I notice that the DP per each variant is actually equal to the sum of DP of each of the vcf files. For example: If for a shared variant in both vcf files the DP=8, then the DP in the union file will be DP=16. Neverhteless, if a given variant is not shared by both files, then the DP in the union file will be equal to the input file.

Is there a way resolve this issue?

Thanks!

Best,

Sagi


Created 2014-01-10 15:39:54 | Updated 2014-01-10 16:16:06 | Tags: ad dp
Comments (1)

Hi,

I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant.

GT:AD:DP:GQ:PL 0/1:52,37:88:99:1078,0,1486

However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged.

GT:AD:DP:GQ:PL 0/1:51,0:88:99:0,99,1283

I think I recall another post having a similar issue due to multithreaded use of select variants

http://gatkforums.broadinstitute.org/discussion/1943/weird-behaviour-of-selectvatiants

APologies for not commenting on this post instead as I had already posted this prior to seeing the other post!

Thanks,

MC


Created 2013-11-15 21:38:02 | Updated | Tags: haplotypecaller dp
Comments (1)

I am using GATK 2.5, and am calling HaplotypeCaller as

java -Xmx4g -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T HaplotypeCaller -I sample.bam -o hap.vcf -L subset.bed -dt NONE -D dbsnp_137.b37_addChr.vcf -filterNoBases -A Coverage -A DepthPerAlleleBySample

In the output vcf file, in the FORMAT field, I have GT:AD:GQ:PL. DP is lost. Even though it is specified in the header. UnifiedGenotyper is working fine, and producing DP information. Any ideas on what is wrong?

Thanks


Created 2013-10-23 12:12:10 | Updated | Tags: dp sigma
Comments (7)

Hi,

I would like to filter variants based on max DP. I understand that in order to define the max DP cutoof there is a need to calculate the Sigma, as stated in GATK's best practices:

"The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts."

Hence, how can one calculate the sigma, and what is it exactly?

Thanks!

Sagi


Created 2013-10-16 12:50:25 | Updated | Tags: unifiedgenotyper haplotypecaller dp
Comments (27)

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,

Sam


Created 2013-01-07 13:05:07 | Updated | Tags: variantfiltration dp
Comments (23)

Hi Team,

I have a multi-sample VCF file produced by UnifiedGenotyper. I now want to filter this file marking those variants with a low depth. However the DP entry in the info field is across all samples, and even if it were possible to assess the individual's DPs, I would then have to resolve the issue of a variant having low depth in one sample, and high in another. Any suggestions are appreciated.

Thanks for your time


Created 2012-12-21 23:24:24 | Updated 2013-01-07 19:15:12 | Tags: dp
Comments (3)

Hello,

I am trying Unified Genotyper on a 454 data set combined with some artificial data. The bam file for this data has 12 aligned reads spanning the locus: Chr17:7578406. These 12 reads are obtained from real 454 run. On executing Unified Genotyper with the following options:

java -Xmx4g -jar /usr/local/lib/GenomeAnalysisTK-2.2-15-g9214b2f/GenomeAnalysisTK.jar -R ~/CGP/ReferenceSeq/fromGATK/bundle1.5/b37/human_g1k_v37.fasta -T UnifiedGenotyper -I NG.CL316.ordered.sorted.bam -o NG.1x.CL316.vcf --dbsnp ~/CGP/ReferenceSeq/fromGATK/bundle1.5/b37/dbsnp_135.b37.vcf -stand_call_conf 30 -stand_emit_conf 0.0 -dcov 5000 -L "17:7,578,400-7,578,410"

I get the following result:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       T       83.77   .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.529;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=2.7884;MLEAC=1;MLEAF=0.500;MQ=38.06;MQ0=0;MQRankSum=0.529;QD=6.98;ReadPosRankSum=2.367     GT:AD:DP:GQ:PL  0/1:7,5:12:99:112,0,195

Then I added a copy of same 12 reads (after modifying the names of the added reads to give each added read a unique name) to the above bam file. I have ensured that the names are unique in the bam file. Then again I run UnifiedGenotyper and get the following result:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       T       165.77  .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.966;DB;DP=24;Dels=0.00;FS=1.885;HaplotypeScore=5.5768;MLEAC=1;MLEAF=0.500;MQ=38.06;MQ0=0;MQRankSum=0.205;QD=6.91;ReadPosRankSum=3.367     GT:AD:DP:GQ:PL  0/1:14,10:22:99:194,0,363

I want to point out that in the first run, I see that INFO and FORMAT column's have DP=12 and AD of 7,5 add up indicating that no read for filtered for making the variant call. However in the second run, even though the 12 extra reads are exactly the same as the earlier 12 reads, we see that while the INFO column DP=24, the FORMAT column DP is 22. Which indicates that 2 reads have been filtered while making the variant call.

Also please note that I tried adding the 12 read incrementally. And observed that till adding the 7th read, the DPs were all matching up. But then when I added the 8th read, the result showed that 2 reads were filtered (from FORMAT column DP). So if only the 8th read was being filtered out, I would think it might be a problem with that read but why is an extra read which was not being filtered earlier is being filtered on addition of this read. Or is the FORMAT column DP incorrect?

The 12 aligned reads that were added to the sam file are attached. Please let me know if you would need any other info.

Thanks for any insight in this issue.


Created 2012-10-19 18:14:30 | Updated 2012-10-19 18:24:47 | Tags: reducereads ad dp
Comments (4)

Does the relationship between AD and DP stil hold in VCF produced from ReduceRead BAMs? That is the sum of AD is <= DP Or can other scenarios now occur?

Also is AD summarized to 1,0 or 0,1 for homozygous REF and ALT? Thanks.