This document describes the proper use of metrics associated with depth of coverage for the purpose of evaluating variants.
The metrics involved are the following:
For an overview of the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?", please see this document.
The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:
At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.
The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.
Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.
TO BE CONTINUED...
I am also trying to check the coverage at each position of my reference using the CoverageBySample tool (with and without the –L argument):
java -Xmx30g -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ –T CoverageBySample \ –R ref.fasta \ -I input.bam \ -o output.cov\
The output (below) is giving the right coverage but without the positions on the reference and also skipping all positions with no coverage. Is there any way to get these positions in the output file?
eo78 10 eo78 10 eo78 10 eo78 10 eo78 10 eo78 11 eo78 12 eo78 12 eo78 12