Tagged with #dp
0 documentation articles | 0 announcements | 8 forum discussions


No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (9)

Hi team,

I've been having some issues with DP following CombineVariants:

I have two vcf files called by different callers - one by GATK UnifiedGenotyper and the other by samtools mpileup.

When I merge the files using CombineVariants, I notice that the DP per each variant is actually equal to the sum of DP of each of the vcf files. For example: If for a shared variant in both vcf files the DP=8, then the DP in the union file will be DP=16. Neverhteless, if a given variant is not shared by both files, then the DP in the union file will be equal to the input file.

Is there a way resolve this issue?

Thanks!

Best,

Sagi

Comments (1)

Hi,

I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant.

GT:AD:DP:GQ:PL 0/1:52,37:88:99:1078,0,1486

However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged.

GT:AD:DP:GQ:PL 0/1:51,0:88:99:0,99,1283

I think I recall another post having a similar issue due to multithreaded use of select variants

http://gatkforums.broadinstitute.org/discussion/1943/weird-behaviour-of-selectvatiants

APologies for not commenting on this post instead as I had already posted this prior to seeing the other post!

Thanks,

MC

Comments (1)

I am using GATK 2.5, and am calling HaplotypeCaller as

java -Xmx4g -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T HaplotypeCaller -I sample.bam -o hap.vcf -L subset.bed -dt NONE -D dbsnp_137.b37_addChr.vcf -filterNoBases -A Coverage -A DepthPerAlleleBySample

In the output vcf file, in the FORMAT field, I have GT:AD:GQ:PL. DP is lost. Even though it is specified in the header. UnifiedGenotyper is working fine, and producing DP information. Any ideas on what is wrong?

Thanks

Comments (7)

Hi,

I would like to filter variants based on max DP. I understand that in order to define the max DP cutoof there is a need to calculate the Sigma, as stated in GATK's best practices:

"The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts."

Hence, how can one calculate the sigma, and what is it exactly?

Thanks!

Sagi

Comments (7)

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,

Sam

Comments (22)

Hi Team,

I have a multi-sample VCF file produced by UnifiedGenotyper. I now want to filter this file marking those variants with a low depth. However the DP entry in the info field is across all samples, and even if it were possible to assess the individual's DPs, I would then have to resolve the issue of a variant having low depth in one sample, and high in another. Any suggestions are appreciated.

Thanks for your time

Comments (3)

Hello,

I am trying Unified Genotyper on a 454 data set combined with some artificial data. The bam file for this data has 12 aligned reads spanning the locus: Chr17:7578406. These 12 reads are obtained from real 454 run. On executing Unified Genotyper with the following options:

java -Xmx4g -jar /usr/local/lib/GenomeAnalysisTK-2.2-15-g9214b2f/GenomeAnalysisTK.jar -R ~/CGP/ReferenceSeq/fromGATK/bundle1.5/b37/human_g1k_v37.fasta -T UnifiedGenotyper -I NG.CL316.ordered.sorted.bam -o NG.1x.CL316.vcf --dbsnp ~/CGP/ReferenceSeq/fromGATK/bundle1.5/b37/dbsnp_135.b37.vcf -stand_call_conf 30 -stand_emit_conf 0.0 -dcov 5000 -L "17:7,578,400-7,578,410"

I get the following result:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       T       83.77   .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.529;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=2.7884;MLEAC=1;MLEAF=0.500;MQ=38.06;MQ0=0;MQRankSum=0.529;QD=6.98;ReadPosRankSum=2.367     GT:AD:DP:GQ:PL  0/1:7,5:12:99:112,0,195

Then I added a copy of same 12 reads (after modifying the names of the added reads to give each added read a unique name) to the above bam file. I have ensured that the names are unique in the bam file. Then again I run UnifiedGenotyper and get the following result:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       T       165.77  .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.966;DB;DP=24;Dels=0.00;FS=1.885;HaplotypeScore=5.5768;MLEAC=1;MLEAF=0.500;MQ=38.06;MQ0=0;MQRankSum=0.205;QD=6.91;ReadPosRankSum=3.367     GT:AD:DP:GQ:PL  0/1:14,10:22:99:194,0,363

I want to point out that in the first run, I see that INFO and FORMAT column's have DP=12 and AD of 7,5 add up indicating that no read for filtered for making the variant call. However in the second run, even though the 12 extra reads are exactly the same as the earlier 12 reads, we see that while the INFO column DP=24, the FORMAT column DP is 22. Which indicates that 2 reads have been filtered while making the variant call.

Also please note that I tried adding the 12 read incrementally. And observed that till adding the 7th read, the DPs were all matching up. But then when I added the 8th read, the result showed that 2 reads were filtered (from FORMAT column DP). So if only the 8th read was being filtered out, I would think it might be a problem with that read but why is an extra read which was not being filtered earlier is being filtered on addition of this read. Or is the FORMAT column DP incorrect?

The 12 aligned reads that were added to the sam file are attached. Please let me know if you would need any other info.

Thanks for any insight in this issue.

Comments (4)

Does the relationship between AD and DP stil hold in VCF produced from ReduceRead BAMs? That is the sum of AD is <= DP Or can other scenarios now occur?

Also is AD summarized to 1,0 or 0,1 for homozygous REF and ALT? Thanks.