This document describes the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?".
The tools involved are the following:
DepthOfCoverage: for QC'ing coverage in whole-genome data (WGS)
DiagnoseTargets: for QC'ing coverage in exome data (WEx)
For an overview of the major annotations that are used by variant callers to express read depth at a variant site, and guidelines for using those metrics to evaluate variants, please see this document.
Coverage analysis generally aims to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?".
This section is incomplete.
DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:
That last matrix is key to answering the question posed above, so we recommend running this tool on all samples together.
Note that DepthOfCoverage can be configured to output these statistics aggregated over genes by providing it with a RefSeq gene list.
DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.
To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument
The provided gene list must be of the following format:
585 NM_001005484 chr1 + 58953 59871 58953 59871 1 58953, 59871, 0 OR4F5 cmpl cmpl 0, 587 NM_001005224 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F3 cmpl cmpl 0, 587 NM_001005277 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F16 cmpl cmpl 0, 587 NM_001005221 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F29 cmpl cmpl 0, 589 NM_001005224 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F3 cmpl cmpl 0, 589 NM_001005277 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F16 cmpl cmpl 0, 589 NM_001005221 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F29 cmpl cmpl 0,
For users who have access to internal Broad resources, the properly-formatted file containing refseq genes and transcripts is located at
If you do not have access (if you don't know, you probably don't have it), you can generate your own as described here.
If you supply the
-geneList argument, DepthOfCoverage will output an additional summary file that looks as follows:
Gene_Name Total_Cvg Avg_Cvg Sample_1_Total_Cvg Sample_1_Avg_Cvg Sample_1_Cvg_Q3 Sample_1_Cvg_Median Sample_1_Cvg_Q1 SORT1 594710 238.27 594710 238.27 165 245 330 NOTCH2 3011542 357.84 3011542 357.84 222 399 >500 LMNA 563183 186.73 563183 186.73 116 187 262 NOS1AP 513031 203.50 513031 203.50 91 191 290
Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The
-geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes,
-geneList is ignored if
-omitIntervals is thrown.
DiagnoseTargets produces a pseudo-VCF file that provides a "CallableStatus" judgment for each position or range of positions in the input bam file. The possible judgments are as follows:
PASS : The base satisfied the min. depth for calling but had less than maxDepth to avoid having EXCESSIVE_COVERAGE.
COVERAGE_GAPS : Absolutely no coverage was observed at a locus, regardless of the filtering parameters.
LOW_COVERAGE : There were less than min. depth bases at the locus, after applying filters.
EXCESSIVE_COVERAGE: More than
-maxDepth read at the locus, indicating some sort of mapping problem.
POOR_QUALITY : More than
--maxFractionOfReadsWithLowMAPQ at the locus, indicating a poor mapping quality of the reads.
BAD_MATE : The reads are not properly mated, suggesting mapping errors.
NO_READS : There are no reads contained in the interval.
I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage.
I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller.
My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue?
In particular, consider pairwise differences across individuals: The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them. However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations).
However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population.
Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values.
Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem?
Many thanks in advance, Hannes
Hi, I'm calling Variants with HaplotypeCaller in a population of 2 Parents and 7 F1-individuals. After read backed phasing I'm combining the vcf files of my genotypes with CombineVariants. In the outfile I very often find "./.". I thought this means there is no coverage at a certain position. But at many positions I do have good coverage. Why do I then get ./.? Moreover I used FastaAlternateReferenceMaker and created a new reference sequence including the variants from the parents. In that case, after I run HC and do the phasing and combine variants steps, I only get "./." at positions where there is really no coverage (as I can see in my mappings). Nadia
I have several samples, some with a coverage of around 14, some with a coverage around 6. I want to use UnifiedGenotyper for SNP calling but I have no clue how to set stand_call_conf (and stand_emit_conf) as it is suggested to set stand_call_conf for samples with coverage >10 to 30 and for samples with coverage <10 to Q4. So how should I procede?
Calling SNPs using a single bam file with the command:
java -Xmx30g -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R ref.fasta \ -I input.bam \ -o output.vcf \
and when looking at the output file, most DP values were equal to the AD values and in few cases the AD value was higher. Thought that AD values are the unfiltered counts of all reads and DP fields describes the total depth of reads that passed the Unified genotyper’s internal quality control. Is it normal for the AD values to be higher than the DP value?
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT eo227 gi|218430358|emb|CU928163.2| 2317180 . T G 76.55 . AC=1;AF=0.50;AN=2;BaseQRankSum=1.568;DP=10;Dels=0.00;FS=11.181;HRun=0;HaplotypeScore=15.8585;MQ=28.63;MQ0=0;MQRankSum=1.036;QD=7.66;ReadPosRankSum=-0.633;SB=-0.01 GT:AD:DP:GQ:PL 0/1:6,4:10:99:107,0,154 gi|218430358|emb|CU928163.2| 2317181 . T G 71.96 . AC=1;AF=0.50;AN=2;BaseQRankSum=0.550;DP=10;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=19.8574;MQ=28.63;MQ0=0;MQRankSum=-1.754;QD=7.20;ReadPosRankSum=-1.754;SB=-0.01 GT:AD:DP:GQ:PL 0/1:3,4:10:87.90:102,0,88
I am also trying to check the coverage at each position of my reference using the CoverageBySample tool (with and without the –L argument):
java -Xmx30g -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ –T CoverageBySample \ –R ref.fasta \ -I input.bam \ -o output.cov\
The output (below) is giving the right coverage but without the positions on the reference and also skipping all positions with no coverage. Is there any way to get these positions in the output file?
eo78 10 eo78 10 eo78 10 eo78 10 eo78 10 eo78 11 eo78 12 eo78 12 eo78 12
Dear GATK team,
I have a question regarding adding the functionality to CallableLoci to allow multiple coverage cutoffs (similar to the -ct option in DepthOfCoverage) for LOW_COVERAGE. Basically for example COVERAGE_BELOW_10X, COVERAGE_BELOW_20X etc.
These multiple statistics are important for WGS interpretation, not just a single LOW_COVERAGE value. At this point a separate DepthofCoverage instance has to be run (to do the same job twice) and takes much additional time. Instead of a single pass CallableLoci.
A simple patch to the CallableLoci code does the job, but it would be great if this can be implemented in the build as a simple command line option.
Hi GATK Team
You are doing an amazing job, keep it up!
I apologise in advance if this question has come up and I've not found it within the forum, but I am quite new to all of this and would like to ask you a few questions regarding identifying structural variation from exome resequencing data:
I am trying to assess the best method to identify potential structural variants from a single bam file: One way of doing this proposed to me was to look at DP values (using UnifiedGenotyper) that are less than 5 and understandably there are inherent confounders in doing so. So I ran the same bam file through the DepthOfCoverage tool to focus on regions of interest which have zero coverage. However, when I overlaid the data from both and mapped their co-ordinates to the human genome, I have found that the overlap between the DP values and DoC regions was extremely small (<5%) - why could this be? Surely there should be more overlap? Are they therefore measuring different things? Have I done something wrong somewhere and I don't know it? I have tried to access the documentation for DepthOfCoverage to try and make sense of it but it seems unavailable on the website (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html). Please could you advise?
Below are the command lines I've been using:
java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -omitBaseOutput -omitLocusTable -R referencefilename.fa -I samplefilename.bam -L regionsofinterest.txt -o outputfile.coverage java -jar GenomeAnalysisTK.jar -R referencefilename.fa -T UnifiedGenotyper -I samplefilename.bam --dbsnp dbsnpreferencefile.vcf --genotype_likelihoods_model SNP -o outputfilename.vcf --output_mode EMIT_ALL_SITES -stand_call_conf 50.0 -stand_emit_conf 0.0 -dcov 200 -L regionsofinterest.bed
Thank you in advance for your help, it is much appreciated