# Tagged with #coverage 1 documentation article | 0 announcements | 13 forum discussions

Created 2012-07-23 16:52:26 | Updated 2016-05-10 13:16:47 | Tags: depthofcoverage diagnosetargets coverage

### Overview

This document describes the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?".

The tools involved are the following:

For an overview of the major annotations that are used by variant callers to express read depth at a variant site, and guidelines for using those metrics to evaluate variants, please see this document.

### Introduction to coverage analysis as a QC method

Coverage analysis generally aims to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?".

This section is incomplete.

### Using DepthOfCoverage to QC whole-genome data

DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:

• Total, mean, median, and quartiles for each partition type: aggregate
• Total, mean, median, and quartiles for each partition type: for each interval
• A series of histograms of the number of bases covered to Y depth for each partition type (granular; e.g. Y can be a range, like 16 to 22)
• A matrix of counts of the number of intervals for which at least Y samples and/or read groups had a median coverage of at least X
• A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)
• A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

That last matrix is key to answering the question posed above, so we recommend running this tool on all samples together.

Note that DepthOfCoverage can be configured to output these statistics aggregated over genes by providing it with a RefSeq gene list.

DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.

To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument

-geneList /path/to/gene/list.txt

The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,
587     NM_001005277    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F16  cmpl    cmpl    0,
587     NM_001005221    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F29  cmpl    cmpl    0,
589     NM_001005224    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F3   cmpl    cmpl    0,
589     NM_001005277    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F16  cmpl    cmpl    0,
589     NM_001005221    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F29  cmpl    cmpl    0,

For users who have access to internal Broad resources, the properly-formatted file containing refseq genes and transcripts is located at

/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt

If you do not have access (if you don't know, you probably don't have it), you can generate your own as described here.

If you supply the -geneList argument, DepthOfCoverage will output an additional summary file that looks as follows:

Gene_Name     Total_Cvg       Avg_Cvg       Sample_1_Total_Cvg    Sample_1_Avg_Cvg    Sample_1_Cvg_Q3       Sample_1_Cvg_Median      Sample_1_Cvg_Q1
SORT1    594710  238.27  594710  238.27  165     245     330
NOTCH2  3011542 357.84  3011542 357.84  222     399     &gt;500
LMNA    563183  186.73  563183  186.73  116     187     262
NOS1AP  513031  203.50  513031  203.50  91      191     290

Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.

### Using DiagnoseTargets to QC whole-exome data

DiagnoseTargets produces a pseudo-VCF file that provides a "CallableStatus" judgment for each position or range of positions in the input bam file. The possible judgments are as follows:

• PASS : The base satisfied the min. depth for calling but had less than maxDepth to avoid having EXCESSIVE_COVERAGE.

• COVERAGE_GAPS : Absolutely no coverage was observed at a locus, regardless of the filtering parameters.

• LOW_COVERAGE : There were less than min. depth bases at the locus, after applying filters.

• EXCESSIVE_COVERAGE: More than -maxDepth read at the locus, indicating some sort of mapping problem.

• POOR_QUALITY : More than --maxFractionOfReadsWithLowMAPQ at the locus, indicating a poor mapping quality of the reads.

• BAD_MATE : The reads are not properly mated, suggesting mapping errors.

No articles to display.

Created 2016-03-05 08:25:10 | Updated | Tags: coverage dp

Hello, I used GATK 3.4, HaplotypeCaller to perform my snp calling and this is one row in my vcf file: "Chr03 40626715 . A G 21154.18 . AC=56;AF=0.528;AN=106;BaseQRankSum=-0.302;ClippingRankSum=-1.198;DP=1055;FS=84.348;MLEAC=55;MLEAF=0.519;MQ=60.00;MQRankSum=0.019;QD=20.05;ReadPosRankSum=5.795;SOR=4.519 GT:AD:DP:GQ:PL ...... 1/1:.:.:3:43,3,0 ......."

I was confused with this result, for the GT is '1/1' while both AD and DP is just a '.' So what does this state mean? An SNP without coverage? appreciate for your advanced idea.

Ych L.

Created 2016-03-03 11:46:10 | Updated | Tags: coverage picard calculatehsmetrics

Dear all, I have a question related to CalculateHsMetrics PICARD Tool. I am analysing data coming from an Illumina MiSeq sequencer and we enriched DNA samples using the Illumina TruSight Cardio panel. I aligned the data with BWA with default options and then I removed duplicates using MarkDuplicates PICARD Tool. I am interested in finding regions that have a mean coverage less than 20X. To be more precise I divided the manifest regions in sub-regions of 50 bp using bedtools make windows:

bedtools makewindows -b /storage/genomes/hg19/panels/cardio_manifest.bed -w 50 -i src | sortBed -chrThenSizeA -i - > cardio_manifest_split50.bed

I then created the interval list file using BedToIntervaList with the UNIQUE option set to false:

java -jar /usr/local/cluster/bin/picard.jar BedToIntervalList INPUT=/storage/genomes/hg19/panels/cardio_manifest_split50.bed SEQUENCE_DICTIONARY=/storage/genomes/hg19/annotation/hg19.dict OUTPUT=/storage/genomes/hg19/panels/cardioSplit50_interval.list UNIQUE=false

The command used for HsMetrics is:

java -jar /usr/local/cluster/bin/picard.jar CalculateHsMetrics BAIT_INTERVALS=/storage/genomes/hg19/panels/cardioSplit50_interval.list REFERENCE_SEQUENCE=/storage/genomes/hg19/fa/hg19.fa TARGET_INTERVALS=/storage/genomes/hg19/panels/cardioSplit50_interval.list PER_TARGET_COVERAGE=16-0162_S8_PT METRIC_ACCUMULATION_LEVEL=ALL_READS INPUT=16-0162_S8_RGSorted.bam OUTPUT=16-0162_S8_HS

The problem is that looking at the alignments in the gene HCN4 there is a portion of ~200 bp that is covered less than 20X (~16-18 reads per bp) but it is not detected by CalculateHsMetrics . To verify if the problem was related with the interval list created as explained before, I created an interval list with only two entries, the ones that are exactly in the portion that from IGV I see having a low coverage. But again the coverage evaluated from CalculateHsMetrics is higher than the one expected. I don't understand if I am doing something wrong using the tool or if I am missing something in what is the mean coverage evaluated by PICARD. Could anyone help me in this?

Thank you so much!!

Stefania

Created 2016-03-02 02:32:26 | Updated 2016-03-02 02:49:03 | Tags: depthofcoverage validation coverage depth-of-coverage qualimap molecular-gender gender

In our pipeline we try to do relatively quick molecular gender validation using Depth of Coverage on chromosome 1, X and Y and compare the coverage of each, report any anomalies or move on.

For exomes this works fine once we've worked out the bias for a given capture platform.

For genomes though, when running Depth Of Coverage (3.5-0-g36282e4) we get some wildly exaggerated numbers for Y coverage. For example in a male Chr1 is ~40, X is ~20 and Y is ~40!

We think this is due to repetitive sequences on the Y chromosome being mis-handled as additional coverage. The exome .bed files limit the regions looked at so these repeats aren't seen. The genomic .bed file is just the whole Y.

It's just so consistently close to a 1:1 ratio of Chr1 to Y coverage for all our male samples.

QualiMap (v2.1 build 2015-03-19 12:05) on the other-hand does not have this issue. We get the expected coverages of ~40, ~20 and ~18 respectively.

If it is simply a repetition issue, would there be a way to correct this output?

Could it be something different since the Chr1:Y ratio is roughly 1 for every male sample we run.

Is there better way to validate molecular gender with equal or less processing time?

Thanks.

Script snippet:

trait GATK_Arguments extends CommandLineGATK {
this.reference_sequence = referenceFile
this.isIntermediate     = true
}

trait Finger_Arguments extends CommandLineGATK {
}

trait DoC_Arguments extends DepthOfCoverage {
this.omitDepthOutputAtEachBase  = true
this.omitLocusTable             = true
this.omitIntervals              = true
}

...

val depthOfACoverage = new DepthOfCoverage with GATK_Arguments with Finger_Arguments with DoC_Arguments
val depthOfXCoverage = new DepthOfCoverage with GATK_Arguments with Finger_Arguments with DoC_Arguments
val depthOfYCoverage = new DepthOfCoverage with GATK_Arguments with Finger_Arguments with DoC_Arguments

depthOfXCoverage.input_file = depthOfACoverage.input_file
depthOfYCoverage.input_file = depthOfACoverage.input_file

depthOfACoverage.intervals = Seq(capPlatforms(platform)) // .bed file for exome chip
depthOfXCoverage.intervals = depthOfACoverage.intervals
depthOfYCoverage.intervals = depthOfACoverage.intervals

depthOfACoverage.intervalsString = Seq("1")
depthOfXCoverage.intervalsString = Seq(trueXRegion) // "X:2699521-154931043"
depthOfYCoverage.intervalsString = Seq(trueYRegion) // "Y:2649521-59034050"

val DoCHead = indiPath + "/" + gvDepth

depthOfACoverage.analysisName = individual
depthOfXCoverage.analysisName = individual
depthOfYCoverage.analysisName = individual

depthOfACoverage.jobName = individual + "_IGV_DoAC"
depthOfXCoverage.jobName = individual + "_IGV_DoXC"
depthOfYCoverage.jobName = individual + "_IGV_DoYC"

depthOfXCoverage,
depthOfYCoverage,
depthOfACoverage
)

/home/ClinGen/Resources/Capture_Platforms/GRCh37/Genomic.bed:

1   1   249250621   chromosome 1    0   .
...
X   1   155270560   chromosome X    0   .
Y   1   59373566    chromosome Y    0   .

Depth of Coverage for Autosomal chromosome 1 command line:

'java'  '-Xmx8192m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/home/shawarden/ClinGen/Dev/ID/.queue/tmp'  '-cp' '/home/ClinGen/bin/queue-3.5-0-g36282e4/Queue.jar'  'org.broadinstitute.gatk.engine.CommandLineGATK'  '-T' 'DepthOfCoverage'  '-I' '/mnt/ClinGen/TSD/GRCh37/Active/Genomic/ID/ID.PrintReads.bam'  '-L' '/home/ClinGen/Resources/Capture_Platforms/GRCh37/Genomic.bed'  '-L' '1'  '-isr' 'INTERSECTION'  '-R' '/home/ClinGen/Resources/broad_bundle_b37_v2.5/human_g1k_v37.fasta'  '-o' '/mnt/ClinGen/TSD/GRCh37/Active/Genomic/ID/gvDepthOfCoverage-A'  '-omitLocusTable'  '-omitIntervals'  '-omitBaseOutput'  

Depth of Coverage for Autosomal chromosome 1 sample_summary output:

sample_id   total       mean    granular_third_quartile granular_median granular_first_quartile %_bases_above_15
ID          9212399088  40.89   45                      40              36                      99.5
Total       9212399088  40.89   N/A                     N/A             N/A

Depth of Coverage for X chromosome command line:

'java'  '-Xmx8192m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/home/shawarden/ClinGen/Dev/ID/.queue/tmp'  '-cp' '/home/ClinGen/bin/queue-3.5-0-g36282e4/Queue.jar'  'org.broadinstitute.gatk.engine.CommandLineGATK'  '-T' 'DepthOfCoverage'  '-I' '/mnt/ClinGen/TSD/GRCh37/Active/Genomic/ID/ID.PrintReads.bam'  '-L' '/home/ClinGen/Resources/Capture_Platforms/GRCh37/Genomic.bed'  '-L' 'X:2699521-154931043'  '-isr' 'INTERSECTION'  '-R' '/home/ClinGen/Resources/broad_bundle_b37_v2.5/human_g1k_v37.fasta'  '-o' '/mnt/ClinGen/TSD/GRCh37/Active/Genomic/ID/gvDepthOfCoverage-X'  '-omitLocusTable'  '-omitIntervals'  '-omitBaseOutput'  

Depth of Coverage for X chromosome sample_summary output:

sample_id   total       mean    granular_third_quartile granular_median granular_first_quartile %_bases_above_15
ID          3050588508  20.55   24                      21              18                      87.2
Total       3050588508  20.55   N/A                     N/A             N/A

Depth of Coverage for Y chromosome command line:

'java'  '-Xmx8192m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/home/shawarden/ClinGen/Dev/ID/.queue/tmp'  '-cp' '/home/ClinGen/bin/queue-3.5-0-g36282e4/Queue.jar'  'org.broadinstitute.gatk.engine.CommandLineGATK'  '-T' 'DepthOfCoverage'  '-I' '/mnt/ClinGen/TSD/GRCh37/Active/Genomic/ID/ID.PrintReads.bam'  '-L' '/home/ClinGen/Resources/Capture_Platforms/GRCh37/Genomic.bed'  '-L' 'Y:2649521-59034050'  '-isr' 'INTERSECTION'  '-R' '/home/ClinGen/Resources/broad_bundle_b37_v2.5/human_g1k_v37.fasta'  '-o' '/mnt/ClinGen/TSD/GRCh37/Active/Genomic/ID/gvDepthOfCoverage-Y'  '-omitLocusTable'  '-omitIntervals'  '-omitBaseOutput' 

Depth of Coverage for Y chromosome sample_summary output:

sample_id   total       mean    granular_third_quartile granular_median granular_first_quartile %_bases_above_15
ID          950208612   41.34   25                      22              18                      90.1
Total       950208612   41.34   N/A                     N/A             N/A

QualiMap Command Line:

'/home/ClinGen/bin/qualimap_v2.1/qualimap'  'bamqc'  '-bam' '/mnt/ClinGen/TSD/GRCh37/Active/Genomic/ID/ID.PrintReads.bam'  '-outfile' 'mlQualiMap_genomic.pdf'  '-c'  '--java-mem-size=24G'  '-outformat' 'pdf'  '-outdir' '/mnt/ClinGen/TSD/GRCh37/Active/Genomic/ID/mlQualiMap_genomic'  '-gd' 'HUMAN'  '-nr 500'  '-nt' '16'

QualiMap output:

2.7. Chromosome stats
Name    Length      Mapped bases    Mean coverage   Standard deviation
1       249250621   10494149541     42.1            1,328,790.04
...
X       155270560   3488511556      22.47           3,979
Y       59373566    1085839604      18.29           184,271.52

Created 2016-02-10 13:05:46 | Updated | Tags: depthofcoverage coverage exome next-generation-sequencing

I have tried looking for the good discussion on how to calculate the average coverage of exome sequencing after alignment. I found that depthofcoverage is a good tool to get the output, however, I am unable to understand what all the output of DepthOfCoverage means.

My Aim is to calculate the average x coverage or statistics summary of a depth of coverage of 7 samples of exome sequencing after alignment.

So for that I followed the steps:

1. create an input bam file with list the bam files with path directing to it. file called input_bam.list eg /home/test/Desktop/bam1.bam /home/test/Desktop/bam2.bam /home/test/Desktop/bam3.bam

2. we have bed files with region and chr with headers chr start stop name

3. I created refgene files as well using http://genome.ucsc.edu/cgi-bin/hgTables?command=start plus for region using bed file

and sorted the file using following command sort -nk3 -nk5 hgTables.txt > genes_refgene_sorted.txt

1. after executing following command:

java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed -geneList genes_refgene_sorted.txt -dt NONE

error

MESSAGE: Input file must have contiguous chromosomes. Saw feature chr22:19510547-19512860 followed later by chr18:19993564-19997878 and then chr22:22113947-22221970, for input source: Desktop/genes_refgene_sorted.txt

please suggest if I should sort the file with a different command.

If I use the command without refgene

java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed

I get the following output files

file_base_name_withbedfile.sample_cumulative_coverage_counts file_base_name_withbedfile.sample_cumulative_coverage_proportions file_base_name_withbedfile.sample_interval_statistics file_base_name_withbedfile.sample_interval_summary file_base_name_withbedfile.sample_statistics file_base_name_withbedfile.sample_summary

I don't understand which output file is the best to answer my question fo depth.

In the last output file -- file_base_name_withbedfile.sample_summary the output looks like sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15 test 1162396121 1775.69 500 500 343 91.7 Total 1162396121 1775.69 N/A N/A N/A

I don't understand what to make of it, and why there are NA

and in file file_base_name_withbedfile.sample_interval_summary the output looks like the following, I don't understand what to make out of this apart from total coverage over 3 bam files for that location. That means there are total 6638920 reads (or nt) in 3 bam files (for example) in that particular location. what does test granular Q value mean? which column should I use to average x coverage to state that after alignment the exomes have x coverage.

Target total_coverage average_coverage test_total_cvg test_mean_cvg test_granular_Q1 test_granular_median test_granularQ3 test%_above_15 chr1:1716462-1719040 6638920 2574.22 6638920 2574.22 >500 >500 >500 100.0 chr1:1719110-1720851 4192130 2406.50 4192130 2406.50 >500 >500 >500 91.8 chr1:1721604-1722165 1011309 1799.48 1011309 1799.48 >500 >500 >500 99.3 chr1:1724574-1725729 3912540 3384.55 3912540 3384.55 >500 >500 >500 99.9

If this is a redundant question, could anyone direct me to the correct discussion to understand the output.

Created 2016-01-14 15:00:51 | Updated | Tags: allelebalance variantannotator coverage alleledepth

Some calls I don't see allele depth (AD), for some I don't see depth (DP), and for some I see neither. Are there scenarios where GATK cannot annotate these?

Created 2015-04-28 08:17:21 | Updated | Tags: unifiedgenotyper haplotypecaller coverage bias

Hi,

I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage.

I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller.

My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue?

In particular, consider pairwise differences across individuals: The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them. However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations).

However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population.

Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values.

Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem?

Created 2014-10-29 19:19:59 | Updated | Tags: coverage mutsig mutsigcv

Hi,

I am trying to use MutSig and would like to know if there is a way/script to generate coverage tables for the experiment under analysis. I have tried to use the standard exome_full192.coverage.txt file provided by Broad Institute but I run into the following error:

silent and nonsilent rates are too different

I guess I will have to create my own coverage tables. I have searched various forums thoroughly including Biostars but so far I haven't been able to find a solution. I will very much appreciate if the folks on this forum, who have some experience with MutSig, can help me out.

Thank you so much!

Created 2013-09-27 13:49:14 | Updated | Tags: combinevariants haplotypecaller fastaalternatereference coverage

Hi, I'm calling Variants with HaplotypeCaller in a population of 2 Parents and 7 F1-individuals. After read backed phasing I'm combining the vcf files of my genotypes with CombineVariants. In the outfile I very often find "./.". I thought this means there is no coverage at a certain position. But at many positions I do have good coverage. Why do I then get ./.? Moreover I used FastaAlternateReferenceMaker and created a new reference sequence including the variants from the parents. In that case, after I run HC and do the phasing and combine variants steps, I only get "./." at positions where there is really no coverage (as I can see in my mappings). Nadia

Created 2013-07-08 07:58:11 | Updated | Tags: unifiedgenotyper coverage

Hi

I have several samples, some with a coverage of around 14, some with a coverage around 6. I want to use UnifiedGenotyper for SNP calling but I have no clue how to set stand_call_conf (and stand_emit_conf) as it is suggested to set stand_call_conf for samples with coverage >10 to 30 and for samples with coverage <10 to Q4. So how should I procede?

Created 2013-02-19 12:30:28 | Updated 2013-02-19 20:06:09 | Tags: coverage alleledepth

Calling SNPs using a single bam file with the command:

 java -Xmx30g -jar GenomeAnalysisTK.jar \
-T UnifiedGenotyper  \
-R ref.fasta  \
-I  input.bam  \
-o output.vcf \

and when looking at the output file, most DP values were equal to the AD values and in few cases the AD value was higher. Thought that AD values are the unfiltered counts of all reads and DP fields describes the total depth of reads that passed the Unified genotyper’s internal quality control. Is it normal for the AD values to be higher than the DP value?

 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  eo227

gi|218430358|emb|CU928163.2|    2317181 .       T       G       71.96   .       AC=1;AF=0.50;AN=2;BaseQRankSum=0.550;DP=10;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=19.8574;MQ=28.63;MQ0=0;MQRankSum=-1.754;QD=7.20;ReadPosRankSum=-1.754;SB=-0.01      GT:AD:DP:GQ:PL  0/1:3,4:10:87.90:102,0,88

Created 2013-02-19 12:24:36 | Updated 2013-02-19 20:07:54 | Tags: coveragebysample coverage

I am also trying to check the coverage at each position of my reference using the CoverageBySample tool (with and without the –L argument):

java -Xmx30g -jar GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
–T CoverageBySample \
–R ref.fasta  \
-I  input.bam  \
-o output.cov\

The output (below) is giving the right coverage but without the positions on the reference and also skipping all positions with no coverage. Is there any way to get these positions in the output file?

eo78       10
eo78       10
eo78       10
eo78       10
eo78       10
eo78       11
eo78       12
eo78       12
eo78       12

Created 2012-11-05 18:49:18 | Updated 2012-11-05 19:54:41 | Tags: callableloci coverage

Dear GATK team,

I have a question regarding adding the functionality to CallableLoci to allow multiple coverage cutoffs (similar to the -ct option in DepthOfCoverage) for LOW_COVERAGE. Basically for example COVERAGE_BELOW_10X, COVERAGE_BELOW_20X etc.

These multiple statistics are important for WGS interpretation, not just a single LOW_COVERAGE value. At this point a separate DepthofCoverage instance has to be run (to do the same job twice) and takes much additional time. Instead of a single pass CallableLoci.

A simple patch to the CallableLoci code does the job, but it would be great if this can be implemented in the build as a simple command line option.

Thank you!

Created 2012-10-03 12:09:14 | Updated 2012-10-03 14:26:46 | Tags: coverage

Hi GATK Team

You are doing an amazing job, keep it up!

I apologise in advance if this question has come up and I've not found it within the forum, but I am quite new to all of this and would like to ask you a few questions regarding identifying structural variation from exome resequencing data:

I am trying to assess the best method to identify potential structural variants from a single bam file: One way of doing this proposed to me was to look at DP values (using UnifiedGenotyper) that are less than 5 and understandably there are inherent confounders in doing so. So I ran the same bam file through the DepthOfCoverage tool to focus on regions of interest which have zero coverage. However, when I overlaid the data from both and mapped their co-ordinates to the human genome, I have found that the overlap between the DP values and DoC regions was extremely small (<5%) - why could this be? Surely there should be more overlap? Are they therefore measuring different things? Have I done something wrong somewhere and I don't know it? I have tried to access the documentation for DepthOfCoverage to try and make sense of it but it seems unavailable on the website (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html). Please could you advise?

Below are the command lines I've been using:

java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -omitBaseOutput -omitLocusTable -R referencefilename.fa -I samplefilename.bam -L regionsofinterest.txt -o outputfile.coverage

java -jar GenomeAnalysisTK.jar -R referencefilename.fa -T UnifiedGenotyper -I samplefilename.bam --dbsnp dbsnpreferencefile.vcf --genotype_likelihoods_model SNP -o outputfilename.vcf --output_mode EMIT_ALL_SITES -stand_call_conf 50.0 -stand_emit_conf 0.0  -dcov 200 -L regionsofinterest.bed