# Tagged with #diagnosetargets 1 documentation article | 2 announcements | 13 forum discussions

Created 2012-07-23 16:52:26 | Updated 2015-07-06 13:49:14 | Tags: depthofcoverage diagnosetargets coverage

### Overview

This document describes the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?".

The tools involved are the following:

For an overview of the major annotations that are used by variant callers to express read depth at a variant site, and guidelines for using those metrics to evaluate variants, please see this document.

### Introduction to coverage analysis as a QC method

Coverage analysis generally aims to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?".

This section is incomplete.

### Using DepthOfCoverage to QC whole-genome data

DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:

• Total, mean, median, and quartiles for each partition type: aggregate
• Total, mean, median, and quartiles for each partition type: for each interval
• A series of histograms of the number of bases covered to Y depth for each partition type (granular; e.g. Y can be a range, like 16 to 22)
• A matrix of counts of the number of intervals for which at least Y samples and/or read groups had a median coverage of at least X
• A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)
• A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

That last matrix is key to answering the question posed above, so we recommend running this tool on all samples together.

Note that DepthOfCoverage can be configured to output these statistics aggregated over genes by providing it with a RefSeq gene list.

DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.

To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument

-geneList /path/to/gene/list.txt

The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,
587     NM_001005277    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F16  cmpl    cmpl    0,
587     NM_001005221    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F29  cmpl    cmpl    0,
589     NM_001005224    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F3   cmpl    cmpl    0,
589     NM_001005277    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F16  cmpl    cmpl    0,
589     NM_001005221    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F29  cmpl    cmpl    0,

For users who have access to internal Broad resources, the properly-formatted file containing refseq genes and transcripts is located at

/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt

If you do not have access (if you don't know, you probably don't have it), you can generate your own as described here.

If you supply the -geneList argument, DepthOfCoverage will output an additional summary file that looks as follows:

Gene_Name     Total_Cvg       Avg_Cvg       Sample_1_Total_Cvg    Sample_1_Avg_Cvg    Sample_1_Cvg_Q3       Sample_1_Cvg_Median      Sample_1_Cvg_Q1
SORT1    594710  238.27  594710  238.27  165     245     330
NOTCH2  3011542 357.84  3011542 357.84  222     399     &gt;500
LMNA    563183  186.73  563183  186.73  116     187     262
NOS1AP  513031  203.50  513031  203.50  91      191     290

Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.

### Using DiagnoseTargets to QC whole-exome data

DiagnoseTargets produces a pseudo-VCF file that provides a "CallableStatus" judgment for each position or range of positions in the input bam file. The possible judgments are as follows:

• PASS : The base satisfied the min. depth for calling but had less than maxDepth to avoid having EXCESSIVE_COVERAGE.

• COVERAGE_GAPS : Absolutely no coverage was observed at a locus, regardless of the filtering parameters.

• LOW_COVERAGE : There were less than min. depth bases at the locus, after applying filters.

• EXCESSIVE_COVERAGE: More than -maxDepth read at the locus, indicating some sort of mapping problem.

• POOR_QUALITY : More than --maxFractionOfReadsWithLowMAPQ at the locus, indicating a poor mapping quality of the reads.

• BAD_MATE : The reads are not properly mated, suggesting mapping errors.

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.

## Unified Genotyper

• Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

## Haplotype Caller

• Improved the indexing scheme for gVCF outputs using the reference calculation model.
• The reference calculation model now works with reduced reads.
• Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
• Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

## Variant Recalibrator

• Disable tranche plots in INDEL mode.
• Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

• Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

## Diagnose Targets

• Added calculation for GC content.
• Added an option to filter the bases based on their quality scores.

## Combine Variants

• Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

## Select Variants

• Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.

## Miscellaneous

• SplitSamFile now produces an index with the BAM.
• Length metric updates to QualifyMissingIntervals.
• Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
• Picard jar updated to version 1.104.1628.
• Tribble jar updated to version 1.104.1628.
• Variant jar updated to version 1.104.1628.

Created 2013-03-12 18:00:30 | Updated | Tags: depthofcoverage diagnosetargets

We have decided to continue providing and supporting DepthOfCoverage and DiagnoseTargets for the foreseeable future. Going forward, we'll try to integrate them and develop their features to address the main needs of the community. To this end we welcome your continuing feedback, so please feel free to contribute comments and ideas in this thread.

To all who took the time to tell us what you find useful about DoC and DT (and what you wish it could do), a big thank you! This is always very useful to us because it helps us identify which features are most valuable to our users.

Created 2016-04-13 21:57:40 | Updated | Tags: diagnosetargets vcf gt

Hi,

I am running the following command: ${java} -jar${GATK} \ -T DiagnoseTargets \ -R human_g1k_v37.fasta \ -L truseq_interval.bed \ -ip 100 \ -I cleaned.bam \ -o interval_stats.vcf \ -writeFullFormat \

It works great except that the output .vcf does not contain the GT field, which gives trouble after for annotating the vcc file.

Any workaround maybe?

Thanks !

Created 2016-04-12 20:17:35 | Updated | Tags: diagnosetargets annotation

Hi GATK !

I am using DiagnoseTargets on exome data with the 1000 Genomes human v37 ref genome and Illumina exome interval .bed file. It seems that DiagnoseTargets (unlike DepthOfCoverage) doesn't accept the -genelist argument.

1) What would be a possible alternative to add the gene name on the output interval stat .vcf file (and optionally the --missing intervals)? Is VariantAnnotator (with the --comp argument) would work?

2) What annotation file should I use? The sortByRef.plscript (mentioned here) not being available anymore, did it only (i) discard records with non-Chr1-22/X/Y/M, and (ii) sort by Chr? Does VariantAnnotator automatically adjust from zero-based half-open intervals (UCSC standard) to one-based closed intervals or should I modify the file during the previous steps as well?

Created 2015-10-11 19:18:25 | Updated | Tags: diagnosetargets gccontent

Hello! I tried GATK DiagnoseTargets tool to analyze my intervals metrics and found that INFO field GC sometimes differ from actual GC content from reference genome. There are more than 1000 intervals with GC=0.00. I supposed that this field is somewhat constant for the interval and is measured on reference sequence, but it looks like DiagnoseTargets tool has another logic for it. Is it possible to explain, how does DieagnoseTargets compute GC content?

I am using GATK Version=3.1-1 Tool was stared with default options: java -jar GenomeAnalysisTK.jar -T DiagnoseTargets -R hg19.fa -I sample1.bam -I sample2.bam -I sample3.bam -I sample4.bam -I sample5.bam -I sample6.bam -I sample7.bam -L panel.bed -o output.vcf -missing missing.intervals

Example row from output vcf where GC=0 and actual GC content is not null (there are 61 GC on this interval): chr11 75112674 . C <DT> . NO_READS END=75112787;GC=0.00;IDP=0.00

Created 2015-05-11 11:27:27 | Updated 2015-05-11 12:08:54 | Tags: depthofcoverage diagnosetargets qualifymissingintervals

I using GATK for Clinical Whole-Exome Sequencing. I often have to answer questions for evaluating quality the sequencing run:

1. How good is my genes of interested covered?
2. Which exons are not well covered?
3. Which interval are not well covered?.

I've tried several tools which try to address this question (i.e. bcbio-nextgen, chanjo). But now I have a feeling that the tools (mentioned in the title) can answer more or less my questions, except the nice feature of chajo which allows storing/querying statistics across samples.

My question about these tools: When to use which? From the names and reading description of their command line arguments, I can't answer the question clearly. I tend to try all three in this order: DepthOfCoverage -> QualifyMissingIntervals -> DiagnoseTargets.

So again, when to use which?

Thanks Vang

Created 2015-01-05 17:47:22 | Updated 2015-01-05 17:56:46 | Tags: diagnosetargets

Hi.

I tried running DiagnoseTargets for the same BAM file using two versions of DiagnoseTargets and i see difference in results. Can you please help me with why this is happening. I see a lot of differences in the BAD_MATE filter. I have put in two sample intervals below

i used GATK 3.1-1-g07a4bf8 and GATK-lite v2.3.9-lite and default command line options

## GATK-lite v2.3.9-lite output

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00109 HG00118 HG00134 HG00151 HG00152 HG00156 HG00176 HG00185 HG00237 HG00306 HG00335 HG00377 HG00378 HG00443 HG00452 HG00580 HG01075 NA18982 NA18992 NA19079 NA19116 NA19324 NA19346 NA19431 NA19456 NA19654 NA19664 NA19746 NA20525 NA20801

1 955543 . T

. LOW_MEDIAN_DEPTH AVG_INTERVAL_DP=408.03;END=955763 AVG_INTERVAL_DP:FT:MED:Q1:Q3 19.19:PASS:18.00:9.00:30.00 17.30:PASS:17.00:10.00:25.00 23.40:PASS:23.00:16.00:33.00 14.40:PASS:13.00:8.00:21.00 20.44:PASS:21.00:14.00:27.00 13.71:PASS:13.00:10.00:17.00 22.86:PASS:21.00:14.00:31.00 20.05:PASS:19.00:12.00:26.00 19.88:PASS:19.00:13.00:28.00 19.31:PASS:20.00:12.00:26.00 23.28:PASS:23.00:15.00:32.00 14.91:PASS:14.00:9.00:19.00 3.49:LOW_COVERAGE:4.00:3.00:4.00 7.13:LOW_COVERAGE:8.00:3.00:11.00 3.39:LOW_COVERAGE:4.00:2.00:4.00 8.02:PASS:8.00:5.00:10.00 6.35:LOW_COVERAGE:7.00:3.00:9.00 15.41:PASS:15.00:8.00:21.00 8.85:LOW_COVERAGE:9.00:4.00:13.00 14.95:PASS:15.00:9.00:21.00 16.41:PASS:17.00:7.00:25.00 10.30:LOW_COVERAGE:9.00:2.00:18.00 11.15:PASS:10.00:6.00:15.00 17.19:PASS:15.00:7.00:27.00 10.57:PASS:11.00:5.00:16.00 5.43:LOW_COVERAGE:5.00:2.00:8.00 _7.21:BAD_MATE;LOWCOVERAGE:6.00:2.00:12.00 11.43:PASS:12.00:5.00:17.00 _11.11:BAD_MATE:11.00:4.00:18.00 10.91:BADMATE:12.00:6.00:14.00

1 957571 . T

## GATK 3.1-1-g07a4bf8 output

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00109 HG00118 HG00134 HG00151 HG00152 HG00156 HG00176 HG00185 HG00237 HG00306 HG00335 HG00377 HG00378 HG00443 HG00452 HG00580 HG01075 NA18982 NA18992 NA19079 NA19116 NA19324 NA19346 NA19431 NA19456 NA19654 NA19664 NA19746 NA20525 NA20801

1 955543 . T

. PASS END=955763;GC=0.751;IDP=407.84 FT:IDP:LL:ZL PASS:19.15:0:0 PASS:17.30:0:0 PASS:23.40:0:0 PASS:14.40:17:0 PASS:20.27:0:0 PASS:13.71:0:0 PASS:22.87:0:0 PASS:20.05:0:0 PASS:19.88:0:0 PASS:19.31:0:0 PASS:23.28:0:0 PASS:14.91:0:0 LOW_COVERAGE;POOR_QUALITY:3.49:108:0 LOW_COVERAGE:7.13:65:0 LOW_COVERAGE;POOR_QUALITY:3.39:106:0 PASS:8.02:0:0 LOW_COVERAGE:6.35:57:0 PASS:15.41:0:0 LOW_COVERAGE:8.85:51:0 PASS:14.95:0:0 PASS:16.41:0:0 LOW_COVERAGE:10.30:56:10 PASS:11.15:6:0 PASS:17.19:0:0 PASS:10.57:20:0 LOW_COVERAGE:5.43:74:0 LOW_COVERAGE:7.21:91:0 PASS:11.43:39:0 PASS:11.11:26:0 PASS:10.91:7:0

1 957571 . T

. PASS END=957852;GC=0.610;IDP=578.29 FT:IDP:LL:ZL PASS:15.64:10:0 PASS:22.86:15:4 PASS:24.36:14:0 PASS:21.54:20:10 PASS:31.96:11:0 PASS:15.21:19:0 PASS:17.68:19:0 PASS:21.70:8:8 PASS:19.06:8:11 PASS:17.14:14:11 PASS:23.21:25:0 PASS:11.30:41:5 PASS:23.26:11:0 PASS:24.19:30:0 PASS:22.84:8:0 PASS:19.32:30:0 PASS:24.58:25:0 PASS:15.92:7:4 PASS:16.41:13:0 PASS:9.40:43:7 PASS:18.96:27:4 PASS:12.98:37:3 PASS:13.00:21:0 PASS:15.40:13:7 PASS:13.66:11:0 PASS:13.81:19:0 POOR_QUALITY:7.40:32:19 PASS:14.15:27:9 PASS:15.21:26:0 PASS:56.13:11:0

These are the command lines:

## GATK-lite v2.3.9-lite

Thanks Arun

Created 2014-10-17 15:30:47 | Updated | Tags: diagnosetargets read-depth

Hi again, In my diagnose targets output I have a lot of NO_READS, but most of the targets are ones that have a SNP called by HC on the same set of bam files:

# # # table(rowData(nextseq.dt.vcf)\$FILTER)
# # #                  2136                  1114                 14775
# # # NO_READS;POOR_QUALITY                  PASS          POOR_QUALITY
# # #                   307                 56599                   479 

I have had a look at one site that is flagged as NO_READS, but there are reads:

line from unfiltered VCF from HC

chr1 109808776 . C G 106.25 . AC=1;AF=0.042;AN=24;BaseQRankSum=-1.059e+00;ClippingRankSum=-2.329e+00;DP=127;FS=0.000;GQ_MEAN=36.25;GQ_STDDEV=35.51;InbreedingCoeff=-0.0565;MLEAC=1;MLEAF=0.042;MQ=52.57;MQ0=0;MQRankSum=1.48;NCC=0;QD=6.64;ReadPosRankSum=-2.120e-01 GT:AD:DP:GQ:PL 0/0:4,0:4:12:0,12,87 0/0:12,0:12:36:0,36,297 0/0:7,0:7:21:0,21,169 0/0:21,0:21:60:0,60,514 0/0:8,0:8:24:0,24,172 0/0:6,0:6:6:0,6,90 0/0:11,0:11:24:0,24,254 0/0:11,0:11:27:0,27,288 0/0:11,0:11:27:0,27,255 0/1:9,7:16:99:141,0,1720/0:11,0:11:30:0,30,321 0/0:9,0:9:27:0,27,228

line from DiagnoseTargets

chr1 109808776 . C <DT> . NO_READS END=109808776;GC=1.00;IDP=165.00 FT:IDP:LL:ZL NO_READS:6.00:0:0 NO_READS:20.00:0:0 NO_READS:11.00:0:0 NO_READS:22.00:0:0 NO_READS:10.00:0:0 NO_READS:11.00:0:0 NO_READS:14.00:0:0 PASS:13.00:0:0 NO_READS:9.00:0:0 PASS:15.00:0:0 PASS:19.00:0:0 NO_READS:15.00:0:0

Is there any other filters that these reads are not passing - I couldn't spot anything in the documentation but I wouldn't put it past me! I ran a fairly standard incantation of DiagnoseTargets, I did use --interval_merging OVERLAPPING_ONLY though.

Any thoughts? Many thanks Anna

Created 2014-10-14 15:55:25 | Updated | Tags: diagnosetargets vcf

I have used a VCF file that was produced by GATK for the -L option of DiagnoseTargets, but I get the alternative allele from the original VCF as the reference allele on the output vcf from Diagnose targets: input VCF:

chr1 10425470 . G A 139.99 SNPBHF AC=1;AF=0.042;AN=24;BaseQRankSum=-5.862e+00;Clipp......

chr1 10425470 . A <DT> . PASS END=10425470;GC=1.00;IDP=805.00 FT:IDP:LL:ZL......

The GC content reflects the reference allele though. Is this normal behaviour or a bug?

Created 2014-07-23 10:32:53 | Updated | Tags: diagnosetargets

When running DiagnoseTargets (on a single sample) I get some results I have problems to understand.

As far as I know (using default values): LOW_COVERAGE means >20% of the bases have a coverage below 5 COVERAGE_GAPS means >20% of the bases have a coverage of 0 NO_READS means there are no reads contained in the interval

Question1: What is the difference between these two, as COVERAGE_GAPS always implies LOW_COVERAGE? Or are bases with zero coverage filtered out first and afterwards >20% of the remaining bases have coverage below 5 in the second case?

chr1    33065688        .       T       <DT>    .       COVERAGE_GAPS   END=33071551;GC=0.035;IDP=20.58 FT:IDP:LL:ZL    COVERAGE_GAPS:20.58:102:5419
chr1    33116749        .       G       <DT>    .       COVERAGE_GAPS;LOW_COVERAGE      END=33116923;GC=0.331;IDP=0.886 FT:IDP:LL:ZL    COVERAGE_GAPS;LOW_COVERAGE:0.886:78:97

Question2: How can I explain these entries?

chr1    33116959        .       C       <DT>    .       LOW_COVERAGE;NO_READS   END=33117033;GC=0.693;IDP=3.68 FT:IDP:LL:ZL    LOW_COVERAGE;NO_READS:3.68:75:0
chr2    242274737       .       A       <DT>    .       LOW_COVERAGE;NO_READS   END=242274766;GC=0.133;IDP=1.00 FT:IDP:LL:ZL    LOW_COVERAGE;NO_READS:1.00:10:20

Created 2013-10-24 15:25:47 | Updated | Tags: diagnosetargets

Hello,

1. Is DiagnoseTargets counting only reads that have mapped uniquely? Is that one of the default filters?

2. From the vcf I see that IDP - Average depth across the interval. Sum of the depth in a loci divided by interval size. LL - Number of loci for this sample, in this interval with low coverage (below the minimum coverage) but not zero ZL - Number of loci for this sample, in this interval with zero coverage.

I'm interested in the total number of reads mapped to each interval.

Is this true that IDP = #reads_in_this_interval/(LL+ZL) ? so if I want to extract #reads_in_this_interval, I can look at IDP*(LL+ZL)? I have different number of reads in each sample, so I first need to normalize it.

Thanks! Moran.

Created 2013-10-16 20:29:40 | Updated | Tags: diagnosetargets

Hi again,

is there an automatic way to generate an interval list for every 1kb in the reference sequence? I want to run DiagnoseTargets for every 1kb in the data.

thanks!!

Created 2013-10-16 19:10:38 | Updated | Tags: diagnosetargets

Hello,

Is there a way to run DiagnoseTargets with a list of bam files, instead of multiple "-I bam_file" tags?

thanks!

Created 2013-08-07 11:45:34 | Updated | Tags: diagnosetargets

I a m running DiagnoseTargets on a list of intervals corresponding to targeted exons. In the result file, intervals are filtered by i.e. PASS, LOW_COVERAGE, COVERAGE_GAPS or NO_READS for each sample as well as for the sample set as a whole. In the individual-sample FORMAT fields, information on TF (filter) and IDP (average sample depth across interval) are given. How come I often see a combination like this?

Filtered as NO_READS, yet average depth of 96.09 across the interval? Of course, PART of the interval may be without reads, even more than the threshold set by --coverage_status_threshold, but this is what I understand is meant by COVERAGE_GAPS. I also wondered whether the reads might all have been totally filtered out due to low quality parameters and adjusted --minimum_base_quality and --minimum_mapping_quality to 0, but NO_READS are still flagged for a number of intervals, despite IDP being far from 0. Is this a bug, or have I misunderstood something?

L. Pihlstrom

Created 2013-03-05 15:52:53 | Updated 2013-03-05 15:56:27 | Tags: depthofcoverage diagnosetargets

Now that DepthOfCoverage is being retired in GATK 2.4, I decided to investigate DiagnoseTargets. I have some questions.

1) Are there plans to support output formats other than VCF? What was great about DOC is I could easily send the output to anyone and it could be easily read. With DT, that requires additional processing.

2) DOC provided summary for intervals as well as samples. DT only does intervals. Is there a way to get per-sample info?

3) DOC output full intervals. DT only outputs the start positions. To get the end positions, an additional step is required. Can that be adjusted?

4) DOC provided coverage info for all intervals. DT only shows covered intervals, so if an interval is not covered, it will not be listed in the output. Is there a way to output all intervals?

Sorry if I sound too critical. I am a big fan of DepthOfCoverage and am disappointed to see it go.

Thank you.