# Tagged with #depthofcoverage 2 documentation articles | 1 announcement | 61 forum discussions

Created 2014-10-17 19:19:39 | Updated 2015-07-06 13:44:00 | Tags: coveragebysample depthofcoverage depthperallelebysample ad dp

### Overview

This document describes the proper use of metrics associated with depth of coverage for the purpose of evaluating variants.

The metrics involved are the following:

• DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
• Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

For an overview of the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?", please see this document.

### Coverage annotations: DP and AD

The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:

• DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
• Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.

The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.

Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.

TO BE CONTINUED...

Created 2012-07-23 16:52:26 | Updated 2015-07-06 13:49:14 | Tags: depthofcoverage diagnosetargets coverage

### Overview

This document describes the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?".

The tools involved are the following:

For an overview of the major annotations that are used by variant callers to express read depth at a variant site, and guidelines for using those metrics to evaluate variants, please see this document.

### Introduction to coverage analysis as a QC method

Coverage analysis generally aims to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?".

This section is incomplete.

### Using DepthOfCoverage to QC whole-genome data

DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:

• Total, mean, median, and quartiles for each partition type: aggregate
• Total, mean, median, and quartiles for each partition type: for each interval
• A series of histograms of the number of bases covered to Y depth for each partition type (granular; e.g. Y can be a range, like 16 to 22)
• A matrix of counts of the number of intervals for which at least Y samples and/or read groups had a median coverage of at least X
• A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)
• A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

That last matrix is key to answering the question posed above, so we recommend running this tool on all samples together.

Note that DepthOfCoverage can be configured to output these statistics aggregated over genes by providing it with a RefSeq gene list.

DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.

To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument

-geneList /path/to/gene/list.txt

The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,
587     NM_001005277    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F16  cmpl    cmpl    0,
587     NM_001005221    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F29  cmpl    cmpl    0,
589     NM_001005224    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F3   cmpl    cmpl    0,
589     NM_001005277    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F16  cmpl    cmpl    0,
589     NM_001005221    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F29  cmpl    cmpl    0,

For users who have access to internal Broad resources, the properly-formatted file containing refseq genes and transcripts is located at

/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt

If you do not have access (if you don't know, you probably don't have it), you can generate your own as described here.

If you supply the -geneList argument, DepthOfCoverage will output an additional summary file that looks as follows:

Gene_Name     Total_Cvg       Avg_Cvg       Sample_1_Total_Cvg    Sample_1_Avg_Cvg    Sample_1_Cvg_Q3       Sample_1_Cvg_Median      Sample_1_Cvg_Q1
SORT1    594710  238.27  594710  238.27  165     245     330
NOTCH2  3011542 357.84  3011542 357.84  222     399     &gt;500
LMNA    563183  186.73  563183  186.73  116     187     262
NOS1AP  513031  203.50  513031  203.50  91      191     290

Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.

### Using DiagnoseTargets to QC whole-exome data

DiagnoseTargets produces a pseudo-VCF file that provides a "CallableStatus" judgment for each position or range of positions in the input bam file. The possible judgments are as follows:

• PASS : The base satisfied the min. depth for calling but had less than maxDepth to avoid having EXCESSIVE_COVERAGE.

• COVERAGE_GAPS : Absolutely no coverage was observed at a locus, regardless of the filtering parameters.

• LOW_COVERAGE : There were less than min. depth bases at the locus, after applying filters.

• EXCESSIVE_COVERAGE: More than -maxDepth read at the locus, indicating some sort of mapping problem.

• POOR_QUALITY : More than --maxFractionOfReadsWithLowMAPQ at the locus, indicating a poor mapping quality of the reads.

• BAD_MATE : The reads are not properly mated, suggesting mapping errors.

Created 2013-03-12 18:00:30 | Updated | Tags: depthofcoverage diagnosetargets

We have decided to continue providing and supporting DepthOfCoverage and DiagnoseTargets for the foreseeable future. Going forward, we'll try to integrate them and develop their features to address the main needs of the community. To this end we welcome your continuing feedback, so please feel free to contribute comments and ideas in this thread.

To all who took the time to tell us what you find useful about DoC and DT (and what you wish it could do), a big thank you! This is always very useful to us because it helps us identify which features are most valuable to our users.

Created 2016-02-10 13:05:46 | Updated | Tags: depthofcoverage coverage exome next-generation-sequencing

I have tried looking for the good discussion on how to calculate the average coverage of exome sequencing after alignment. I found that depthofcoverage is a good tool to get the output, however, I am unable to understand what all the output of DepthOfCoverage means.

My Aim is to calculate the average x coverage or statistics summary of a depth of coverage of 7 samples of exome sequencing after alignment.

So for that I followed the steps:

1. create an input bam file with list the bam files with path directing to it. file called input_bam.list eg /home/test/Desktop/bam1.bam /home/test/Desktop/bam2.bam /home/test/Desktop/bam3.bam

2. we have bed files with region and chr with headers chr start stop name

3. I created refgene files as well using http://genome.ucsc.edu/cgi-bin/hgTables?command=start plus for region using bed file

and sorted the file using following command sort -nk3 -nk5 hgTables.txt > genes_refgene_sorted.txt

1. after executing following command:

java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed -geneList genes_refgene_sorted.txt -dt NONE

error

MESSAGE: Input file must have contiguous chromosomes. Saw feature chr22:19510547-19512860 followed later by chr18:19993564-19997878 and then chr22:22113947-22221970, for input source: Desktop/genes_refgene_sorted.txt

please suggest if I should sort the file with a different command.

If I use the command without refgene

java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed

I get the following output files

file_base_name_withbedfile.sample_cumulative_coverage_counts file_base_name_withbedfile.sample_cumulative_coverage_proportions file_base_name_withbedfile.sample_interval_statistics file_base_name_withbedfile.sample_interval_summary file_base_name_withbedfile.sample_statistics file_base_name_withbedfile.sample_summary

I don't understand which output file is the best to answer my question fo depth.

In the last output file -- file_base_name_withbedfile.sample_summary the output looks like sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15 test 1162396121 1775.69 500 500 343 91.7 Total 1162396121 1775.69 N/A N/A N/A

I don't understand what to make of it, and why there are NA

and in file file_base_name_withbedfile.sample_interval_summary the output looks like the following, I don't understand what to make out of this apart from total coverage over 3 bam files for that location. That means there are total 6638920 reads (or nt) in 3 bam files (for example) in that particular location. what does test granular Q value mean? which column should I use to average x coverage to state that after alignment the exomes have x coverage.

Target total_coverage average_coverage test_total_cvg test_mean_cvg test_granular_Q1 test_granular_median test_granularQ3 test%_above_15 chr1:1716462-1719040 6638920 2574.22 6638920 2574.22 >500 >500 >500 100.0 chr1:1719110-1720851 4192130 2406.50 4192130 2406.50 >500 >500 >500 91.8 chr1:1721604-1722165 1011309 1799.48 1011309 1799.48 >500 >500 >500 99.3 chr1:1724574-1725729 3912540 3384.55 3912540 3384.55 >500 >500 >500 99.9

If this is a redundant question, could anyone direct me to the correct discussion to understand the output.

Created 2016-01-29 21:19:34 | Updated | Tags: depthofcoverage gatk

Hi,

I was comparing the output from GATK DepthOfCoverage and samtools mpileup and noticed that for chromosome 1, for instance, the depth values differ per individual:

GATK DepthOfCoverage sample (4th column is for individual of interest):

1:10001 1056 39.11 43 1:10002 1973 73.07 91 1:10003 2728 101.04 120

samtools mpileup (the last field is for the individual of interest):

1 10001 . T 0 . DP=298;I16=246,10,0,0,8031,252873,0,0,266,4956,0,0,216,3484,0,0;QS=8,0;MQSB=0.882047;MQ0F=0.855705 PL:DP:DV:SP:DPR 0,120,16:40:0:0:40,0 1 10002 . A 0 . DP=605;I16=492,13,0,0,15628,487360,0,0,583,11687,0,0,480,4056,0,0;QS=8,0;MQSB=0.82856;MQ0F=0.86281 PL:DP:DV:SP:DPR 0,244,26:81:0:0:81,0 1 10003 . A C, 0 . DP=862;I16=727,19,1,1,22640,694394,40,808,1252,27124,0,0,1102,6936,4,8;QS=7.98193,0.018075,0;VDB=0.02;SGB=-2.77779;RPB=0.485255;MQB=0.873995;MQSB=0.836835;BQB=0.0589812;MQ0F=0.845708 PL:DP:DV:SP:DPR 0,255,48,255,51,48:112:1:0:111,1,0

Any thoughts as to what could be happening?

Thanks, Alva

Created 2016-01-29 14:33:00 | Updated | Tags: depthofcoverage xhmm

Hi,

I am trying to calculate the depthofcoverage for my samples as a prerequisite for running XHMM. When I provide a list of BAM files as the input it works just fine. But I am trying to make the calculation on a single BAM file and later I am trying to merge this read depth calculation to the calculations from 4 groups of 10 BAM files each. I am using the --mergeGATKdepths command from XHMM for merging my read depth files.

Does depthofcoverage not support single BAM files as an input ?

Created 2016-01-06 09:06:30 | Updated 2016-01-06 09:09:09 | Tags: depthofcoverage igv pysam

I was looking at the coverage of my bam files, and I noticed that different tools gave different results for the same file. Has anyone encountered this problem before? Am I using the tools wrong? I'm not sure how to proceed now, using gatk/pysam based on majority vote seems silly since all tools should show the same results.

I used the following methods:

## gatk

gatk -T DepthOfCoverage -R $ref -I$bamfile | head -n 11

genomeCoverageBed -d -ibam bam | head -n 10 ## pysam import pysam def bedtools(filename): """simulate the behaviour of bedtools""" bamfile=pysam.AlignmentFile(filename,'rb') for ref in bamfile.header['SQ']: name=ref['SN'] pileup=bamfile.pileup() for pos,column in enumerate(pileup,1): depth=column.nsegments print(name,pos,depth) if pos >= 10: break ## IGV Manually, using mouse-over of the depth graph in the default view to see the exact read depth on the tooltip # Results ## (I couldn't get it aligned better here, paste it into excel for a proper view)  Position IGV pysam genomeCoverageBed gatk 1 127 89 128 89 2 130 92 131 92 3 130 92 131 92 4 133 95 134 95 5 136 98 137 98 6 137 99 138 99 7 140 102 141 102 8 141 103 142 103 9 142 104 143 104 10 146 108 147 108  Summary of results • Only pysam and gatk agree • IGV seems to count 38 reads more then pysam/gatk • genomeCoverageBed counts 37 more then pysam/gatk Created 2015-12-21 19:25:36 | Updated | Tags: indelrealigner depthofcoverage indelrealigner rejects a region because there are too many reads. --maxReadsForRealignment was set to 100000 for this run. when i run the depthofcoverage tool, it show that the max coverage in the region that was rejected is ~95000. many positions are far less, like around 1000. i don't understand why this region would be rejected since no position has a depth of coverage over the max i set with --maxReadsForRealignment. thanks. Created 2015-11-11 16:52:49 | Updated 2015-11-11 16:54:55 | Tags: depthofcoverage memory Is there a way to manage DepthOfCoverage memory usage? I am having problems when I give it a large intervals file. I can successfully run other tools like RealignerTargetCreator, IndelRealigner, and BaseRecalibrator, which seem like they would be more memory-intensive. I can also run DepthOfCoverage with --omitIntervalStatistics --omitLocusTable --omitDepthOutputAtEachBase. However, running it with just --omitDepthOutputAtEachBase gives me a memory error: ##### ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java  Is there any way to optimize that? Created 2015-10-01 21:12:43 | Updated | Tags: depthofcoverage Dear GATK team We recently noticed that some DepthOfCoverage output (no suffix: per locus coverage) is missing to report some locations of the BED file input. This was a bit more odd when the same locations were actually reported for a different base-quality cut-off. Specifically, there are locations of the BED file that have output entries for -mbq 30 but don't for -mbq 20. Irrespective of -mbq, from what it's read from the documentation (no suffix: per locus coverage) I would assume that the output contain all locations of the BED reported even if the coverage is zero. Right? Would you please clarify how the actual engine works? Thank you Regards Amin Zia Created 2015-08-07 14:25:49 | Updated | Tags: depthofcoverage haplotypecaller dp solid igv Hello Everyone! I'm using the whole GATK workflow to analyze Target Resequencing data coming from SOLID platforms. I followed the Best Practices for analysis and used the proper SOLID flags when using BaseRecalibrator (--solid_recal_mode SET_Q_ZERO_BASE_N --solid_nocall_strategy PURGE_READ), however, when looking at the VCF files after Haplotype Caller something does not add up. I checked some of the variants inside some of my samples and i found that the DP field does not report the same per base coverage value than the one that are reported by the bam (using the --bamOutput to produce a bam for Haplotype Caller) when looking at them using the IGV. As far as I understand, for each position there's a downsampling, but I'm see a lower DP value compared to the ones that are stored in the BAM I'm attaching an IGV screenshots of one of the variants in which i'm encountering this problem. I deactivated all filtering alignment options in IGV, as well as downsampling. Here's the line Reported in the VCF for this variant: chr17 45249306 rs62077265 T C 11069.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.010;ClippingRankSum=-0.616;DB;DP=375;FS=90.048;MLEAC=1;MLEAF=0.500;MQ=59.56;MQRankSum=1.319;QD=29.52;ReadPosRankSum=2.229;SOR=0.016 GT:AD:DP:GQ:PL 0/1:150,224:374:99:11098,0,5080 As you can see from the screenshot, not only the covers differ, but a lot of reads that maps according to the reference are missing- Does somebody has an idea of what happened to the coverage inside the VCF? Thanks a lot for your time! Daniele Created 2015-07-30 06:08:24 | Updated 2015-07-30 06:08:58 | Tags: depthofcoverage Hi, I using DepthOfCoverage to count base counts of my bam file and I got the information as list: Locus Total_Depth Average_Depth_sample Depth_for_sample mane sample mane_base_counts 1:27059142-27059298:1 482 482.00 482 A:1 C:481 G:0 T:0 N:0 1:27059142-27059298:2 482 482.00 482 A:482 C:0 G:0 T:0 N:0 ... And I want to transfer count number to percentage such as: 1:27059142-27059298:1 482 482.00 482 A:0.2 C:99.8 G:0 T:0 N:0 1:27059142-27059298:2 482 482.00 482 A:100 C:0 G:0 T:0 N:0 .. Here "A:0.2" is 1/482100, "C:99.8" is 481/482100, and "A:100" is 482/482*100. So how can I do that after DepthOfCoverage?? Thanks Created 2015-05-11 13:57:37 | Updated | Tags: unifiedgenotyper depthofcoverage haplotypecaller mendelianviolations Hi, Two questions, which relate to Unified Genotyper or Haplocaller: 1. Detecting and determining the exact genotype of an individual seems to be inaccurate at read depths less than 10X. This is because there aren't enough reads to accurately say whether it could be heterozygous. Is there an option in either Unified Genotyper or Haplocaller, that excludes sites in individuals that have below 10X? Alternatively how can these sites be removed from the vcf - or set to missing ? 2. I'm sure I've read somewhere that pedigree information can be supplied to a variant caller (such as Unified Genotyper or Haplocaller), and used to improve the calling accuracy/speed/efficiency. I am calling variants on one parent and multiple offspring. Apologies if these questions are answered in the user manual. I regularly look it but have not found much to answer my questions. Cheers, Blue Created 2015-05-11 11:27:27 | Updated 2015-05-11 12:08:54 | Tags: depthofcoverage diagnosetargets qualifymissingintervals I using GATK for Clinical Whole-Exome Sequencing. I often have to answer questions for evaluating quality the sequencing run: 1. How good is my genes of interested covered? 2. Which exons are not well covered? 3. Which interval are not well covered?. I've tried several tools which try to address this question (i.e. bcbio-nextgen, chanjo). But now I have a feeling that the tools (mentioned in the title) can answer more or less my questions, except the nice feature of chajo which allows storing/querying statistics across samples. My question about these tools: When to use which? From the names and reading description of their command line arguments, I can't answer the question clearly. I tend to try all three in this order: DepthOfCoverage -> QualifyMissingIntervals -> DiagnoseTargets. So again, when to use which? Thanks Vang Created 2015-04-24 10:04:20 | Updated 2015-04-24 10:11:27 | Tags: depthofcoverage genenamesinterval refseq depthcoverage calculatecoverageovergenes Hello , I ve been trying to write a script for calculating coverage per gene,unsuccessfully(!) ,and I found now that is nicely done by GATK ! I would very much need to use this calculation of depthOfCoverage for each gene but I cannot find the geneList needed in the format explained here. I have a RefSeq gene list downloaded from UCSC table which contains RefSeq name ,cds_start & end and "chr" information. Is this acceptable? I want to do it for exons falling inside the genes, which I have downloaded from : ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets/ (phase3), so this would be my Intervals List. It contains only chr info and start-end. I have also calculated for my bam files the bedtools-genomecov with option of "bedgraph" ,so I wrote a script to calculate mean coverage for each exon whose reads fall onto. Is the calculation of DepthOfCoverage done in the same principle ? Moreover I cannot find in UCSC a table which combines RefSeq name with used gene Name. Is it combined in this genesList you provide from GATK? Can you guide me where I could find the exact url for /humgen/.../geneList.txt or if mine could work ,and if exons table is ok with only these 3 columns ? I m a registered member, as for writing in the forum. Is there any extra procedure needed to access your database ? Thanks in advance ! Created 2015-03-24 17:08:36 | Updated | Tags: depthofcoverage When generating the coverage metrics, how does DepthofCoverage treat half-mapped reads (reads where only 1 arm of the mate pair mapped): are they treated like unmapped reads or counted towards coverage for the mapped arm? Created 2015-03-12 12:30:12 | Updated | Tags: depthofcoverage Hello GATK team, I'm running the Best Practice pipeline, and I would like to do some coverage calculation on my samples. I want to use DepthOfCoverage, but I can't decide which bam file to use - the original one, the one after BQSR stage (right before HC) or the bamout of HC? The point is that the first two files don't contain the reads as they were mapped in the variant calling, since the HC changes the mapping dramatically. The bamout contains haplotypes, and therefor reads are merged together, and the coverage might look lower than the one that uses the reads. What do you recommend? Maya Created 2014-12-12 18:51:18 | Updated 2014-12-12 18:53:25 | Tags: depthofcoverage Hello, I'm trying to understand why gte_25 != %_bases_above_25 sample_summary->%_bases_above_25 == 88.6% sample_cumulative_coverage_proportions->gte_25 == 0.91 I would expect both to be identical but they never are. They are always close though. There is something in the calculation that eludes me. Thanks Louis Created 2014-12-05 17:41:36 | Updated | Tags: depthofcoverage haplotypecaller phred quality-scores minor-allele-frequency HaplotypeCaller has a --min_base_quality_score flag to specify the minimum "base quality required to consider a base for calling". It also has the -stand_call_conf and -stand_emit_conf that both of which use a "minimum phred-scaled confidence threshold". What is the difference between the base quality flag and the phred-scaled confidence thresholds flags in deciding whether or not to call a variant? Also, why are there separate flags for calling variants and emitting variants? To my way of thinking, calling and emitting variants are one-and-the-same. Is there any way to specify a minimum minor-allele-frequency for calling variants in Haplotyper? Under the default settings, the program expects 1:1 ratio of each allele in a single diploid organism. However, stochastic variance in which RNA fragments are sequenced and retained could lead to departures from this ratio. Finally, is there any way to specify the minimum level of coverage for a given variant for it to be considered for calling/genotyping? Created 2014-12-03 09:44:51 | Updated | Tags: depthofcoverage Hi GATK team, I think I got some problem when I tried to analysis the GATK DOC results to find the rate of intervals with coverage >=100X in my targeted sequence data. When I check the XXX.sample_interval_statistics file, just use awk -F"\t" '{print102}' XXX.sample_interval_statistics, it gave me the following result:

depth>=100 372

but when I checked the interval_summary file, I tried to do

tail -n +2 XXX.sample_interval_summary | awk -F"\t" '$3>=100' | wc -l, it gave me 387 different from the previous 372. It means the third column in XXX.sample_interval_summary doesn't present the common "X coverage"? Does I misunderstand something here? Thank you very much. bless~ XL Created 2014-11-27 10:13:54 | Updated | Tags: depthofcoverage Hi, When I check the coordinates of intervals given by DepthOfCoverage "*.interval_summary" file, I found that it tried to add one on the start position of each intervals. For example, the original coordinates in my .bed interval file is 3669022-3669474, then the DOC will give 3669023-3669474, only the start position, not the end one, actually. I am not sure whether this is caused by "start from 0" or "start from 1" problem, because I have no idea about the coordinates in my original bed files start from 1 or start from 0. I just worry does GATK DOC will miss one base for each interval? bless~ XL Created 2014-11-27 09:23:12 | Updated 2014-11-27 09:43:50 | Tags: depthofcoverage error I'm getting the following error in GATK 3.2-2: MESSAGE: SAM/BAM file /home/dwragg/work/Analysis/TEST/AOC35_ATCACG_L008/AOC35_ATCACG_L008_bootstrap.bam is malformed: Program record with group id GATK PrintReads already exists in SAMFileHeader! When attempting to calclate depth of coverage: java -d64 -jar${GATK}/GenomeAnalysisTK.jar \ -T DepthOfCoverage \ -R ${REF} \ -I${OUT}/${SAMPLE}/${SAMPLE}_bootstrap.bam \ -o ${OUT}/${SAMPLE}/metrics/{SAMPLE}_GATKcov \ -ct 2 -ct 5 -ct 8 \ --omitDepthOutputAtEachBase \ --omitIntervalStatistics \ --omitLocusTable \ -l FATAL I'm assuming this is not normal as everything was working fine in 3.1-1. Ironically I recently switched to 3.2-2 because the RealignerTargetCreator was giving me an Unsupported major.minor version 51.0 error. I'm running Java 1.7.0-b147. [EDIT} I've since noticed that the BAM file was created with GATK version 2.4-9-g532efad. So the 'current' and 'latest' software filing system in place here appears to have failed. I can confirm that GATK v3.2-2 DepthOfCoverage tool generates the above program record group id error. I'll chase the administrators down to ensure the latest GATK version is installed and start from scratch. Created 2014-11-13 20:22:25 | Updated 2014-11-13 20:27:19 | Tags: commandlinegatk depthofcoverage Hi, I have used the following commands using DepthOfCoverage tool with two different bed files:  java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ucsc_hg19.fa -I WT_recalibrated.bam -L coverage_summary.bed -ct 1 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100 -o WT_cov The line count for the input and output: wc -l WT_cov.sample_interval_summary
4988 WT_cov.sample_interval_summary
$wc -l coverage_summary.bed 10585 coverage_summary.bed In the other case:  java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ucsc_hg19.fa -I WT_recalibrated.bam -L exon.bed -ct 1 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100 -o WT_exon Line count for the input and output: $ wc -l WT_exon.sample_interval_summary
5065 WT_exon.sample_interval_summary
wc -l exon.bed 5065 exon.bed The input in both the cases is of the standard format as shown below:  chr1 6529578 6529755 chr1 6530273 6530442 chr1 6530543 6530721 chr1 6530773 6530980 chr1 6531028 6531730 chr1 6531768 6531914 chr1 6532563 6532713 chr1 6533023 6533273 Could anyone help to interpret the discrepancy between number of target regions in bed file and _interval_summary file in the above two cases? Created 2014-11-10 22:31:59 | Updated | Tags: depthofcoverage Hi all, I am using depthofCoverage tool as shown below:  java -Xmx2g -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ucsc_hg19.fa -geneList:REFSEQ refGene.bed -L exons.bed -ct 1 -ct 5 -ct 10 -ct 20 -ct 50 -I WT_recalibrated.bam -o cov  Original exons.bed file looks like:  chr1 6526151 6527632 PLEKHG5_1 0 + chr1 6527884 6528646 PLEKHG5_2 0 + chr1 6529101 6529301 PLEKHG5_3 0 + chr1 6529394 6529510 PLEKHG5_4 0 + chr1 6529603 6529736 PLEKHG5_5 0 + chr1 6530295 6530415 PLEKHG5_6 0 + chr1 6530565 6530703 PLEKHG5_7 0 + chr1 6530794 6530944 PLEKHG5_8 0 + chr1 6531049 6531697 PLEKHG5_9 0 + Then, refGene.bed file is prepared based on the awk solution provided at the thread 'http://gatkforums.broadinstitute.org/discussion/3229/exon-output-depth-of-coverage' refGene.bed 1 PLEKHG5_exon_1 chr1 + 6526151 6527632 6526151 6527632 1 6526151 6527632 0 PLEKHG5 exon_1 cmpl cmpl 0 2 PLEKHG5_exon_2 chr1 + 6527884 6528646 6527884 6528646 1 6527884 6528646 0 PLEKHG5 exon_2 cmpl cmpl 0 3 PLEKHG5_exon_3 chr1 + 6529101 6529301 6529101 6529301 1 6529101 6529301 0 PLEKHG5 exon_3 cmpl cmpl 0 4 PLEKHG5_exon_4 chr1 + 6529394 6529510 6529394 6529510 1 6529394 6529510 0 PLEKHG5 exon_4 cmpl cmpl 0 5 PLEKHG5_exon_5 chr1 + 6529603 6529736 6529603 6529736 1 6529603 6529736 0 PLEKHG5 exon_5 cmpl cmpl 0 However, the above command returns the error:  ##### ERROR MESSAGE: Error parsing line at byte position: htsjdk.tribble.readers.LineIteratorImpl@5916c00a, for input source:refGene.bed Could someone help to fix this. This is not basically a refGene file. This is prepared from a custom exon.bed file which doesn't have refGene notations using the awk solution in the above thread. Created 2014-11-01 18:49:19 | Updated | Tags: depthofcoverage gatk-walkers Greetings, I am trying to run a depth of coverage calculation as part of the XHMM workflow for cnv analysis. Upon running the following bash code, I am returned with this error. java -Xmx3072m -jar ./Sting/dist/GenomeAnalysisTK.jar \ -T DepthOfCoverage -I group1.READS.bam.list -L EXOME.interval_list \ -R ./human_g1k_v37.fasta \ -dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable \ --minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 \ --includeRefNSites \ --countType COUNT_FRAGMENTS \ -o group1.DATA ##### ERROR MESSAGE: Argument with name 'includeRefNSites' isn't defined. Thanks! Steven Created 2014-10-03 22:05:56 | Updated 2014-10-07 20:15:50 | Tags: depthofcoverage calculatecoverageovergenes I've been learning how to use the DepthOfCoverage to calculate coverage across genes. I noticed that some genes were covered by the bam, in the interval_list file, and in the geneList file, but were not reported in the gene_summary table. Most of these were non-coding RNAs, and in reviewing the geneList file, I noticed that for the non-coding RNAs, the Coding region start position is 1 base higher than the Coding region end position (which also put the Coding region start position higher than the Transcription end position). I adjusted the file and made the Coding region identical to the Transcription regions for the non-coding RNAs, and this resolved the issue for most of the genes. It appears that remaining genes that are still not reported in the gene_summary table, all overlap with exons or UTRs from other genes. My questions are: 1. Is it the case that if two exons overlap, only one will be reported on, or is something else going on? 2. What regions of the gene is the tool reporting on, the whole transcribed region, or just the coding region? 3. Am I safe in changing the coding regions of non-coding RNAs to equal the transcribed region for the coverage analysis? Thank you. Created 2014-10-03 19:32:28 | Updated | Tags: depthofcoverage I ran DepthOfCoverage on several samples for ~5000 genes. It turns out that a handful (~30) of these genes are overlapping (but on different strands). In these cases, it seems that DoC analyzes the first overlapping gene but not the second overlapping gene. For example, in this case (below), it will produce depth stats for the first interval, but not the second interval. You can see the end of the first interval 495620 is actually bigger than the beginning of the second interval 494949. Pv_Sal1_chr14:493199-495620 ## On (-) strand Pv_Sal1_chr14:494949-495542 ## On (+) strand Is there a way to force DoC to analyze overlapping intervals? I'm using GATK v3.2.2 with the following command: java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ref.fasta -I bamnames.list -o coverage/test.cov -geneList coverage/test.refseq -L coverage/test.intervals -omitBaseOutput --minMappingQuality 20 --minBaseQuality 20 Thanks for any help! Created 2014-09-22 08:02:30 | Updated 2014-09-22 08:03:38 | Tags: depthofcoverage incompatible-contigs I am trying to compute mean coverage (using GATK DepthOfCovearge) for a BAM file (targeting sequencing) aligned using reference hg19. java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ucsc.hg19.fasta \ -T DepthOfCoverage \ -I my_bam.list \ -L my_targets.bed \ -o coverage The problem reported is: ##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths: ##### ERROR contig reads = chrM / 16569 ##### ERROR contig reference = chrM / 16571. ##### ERROR reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM] ##### ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249] ##### ERROR ------------------------------------------------------------------------------------------ Could you please help me to find a solution? Many thanks in advance. Created 2014-07-21 22:34:54 | Updated | Tags: depthofcoverage Hi Geraldine/GATKers :-) I was trying to access the documentation for DepthOfCoverage from this page: http://gatkforums.broadinstitute.org/discussion/40/using-depthofcoverage-to-find-out-how-much-sequence-data-you-have that links to here: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html But the link doesn't seem to be working. Could you possibly update the link so that I could take a look at the argument details? Thank you so much as always! Steve Created 2014-06-27 09:45:39 | Updated 2014-06-27 09:46:05 | Tags: depthofcoverage haplotypecaller Hi we've been looking at results of a recent run of GATK-HC (3.1-1) using the new N1 pipeline and we've been seeing something odd in the distribution of the Depth of Coverage (Using DP from the Genotype Fields) we're seeing for the raw unfiltered variants. All our samples are sequenced using PCR-Free libraries and have two lanes of sequence (~24x mapped depth) and looking at depth of coverage from bedtools we see a nice clean distribution (red in graph) but when we look at the data from the HaplotypeCaller sites (Blue in graph) we see a bimodal distribution with an excess of variants called at lower coverage (~16x) vs the most common coverage of around 24x. We've seen this in all the samples we've looked at so far, so it's not just a one off. I've had a quick look at read depth from another variant caller (Platypus) and there we see no evidence of this bimodal distribution in the variants it has called. Is this expected behaviour? If so why does it occur? If not any idea what is going on here, is it a bug in the variant caller or the depth statistics? Do you see the same thing in other datasets? Thanks! Created 2014-06-19 13:50:30 | Updated | Tags: depthofcoverage gvcf Dear all, I have a large number of gVCF files, either single samples or combined in approximately 100 samples per combined gVCF file. I would like to compute something like the average depth for a set of regions of interest from the combined or single gVCF files. I can see that I could try to get a VCF output at every position, and use that to infer the percentage with a given read depth, but that seems mightily cumbersome. So my question: is there an equivalent to the DepthOfCoverage GATK module that takes as input gVCF files? ideally combined gVCF and extract per sample average/median/minimum depth but otherwise I can work with single sample gVCF data. Thank you in advance Created 2014-06-05 18:43:25 | Updated 2014-06-05 18:44:41 | Tags: depthofcoverage I have a feature request: I'm a little surprised to find that the DepthOfCoverage tool still does not allow parallelization as of v3.1-1. Would it be possible to parallelize this tool and others that consider samples/BAMs data independent. Pretty please... :-) Thanks! ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis DepthOfCoverage currently does not support parallel execution with nt. Please run your analysis without the nt option. ##### ERROR ------------------------------------------------------------------------------------------ Created 2014-05-20 09:18:54 | Updated | Tags: depthofcoverage haplotypecaller gvcf Dear all, I am interested in creating genome-wide gVCFs from various sources, including exome samples. The rationale is that it is not completely clear what the target region actually is all the time, and it would be good to keep some flexibility about what I capture. Nevertheless, these single sample gVCF files can become quite large, but most of the lines in the file are used to store very low depth data ( DP <= 2) which is probably not critical for downstream analysis. I would be interested to have, in the HaplotypeCaller gVCF data generation process, an additional parameter that sets as missing regions with, say, 2 reads or less. That would considerably reduce the storage at little cost in terms of discarded information. Does such a parameter exist already? Would it make sense at all to add this? I am not sure but right now it seems to me it might be useful. Vincent Created 2014-04-18 19:01:29 | Updated 2014-04-18 19:02:25 | Tags: depthofcoverage I am using DepthOfCoverage for the first time I used the following command to get the coverage and I am getting only one output file, which gives coverage per samples. I am giving gene list in REFSEQ format along with a sorted interval BED file. However I am not getting any file with per gene coverage. Could anyone clarify whether I am missing something in the arguments java -Xmx8g -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T DepthOfCoverage --calculateCoverageOverGenes:REFSEQ Genes_refgene.txt --outputFormat table -o Coverage_summary -I BAM.list -L genes.bed Thanks, Tinu Created 2014-03-13 08:54:14 | Updated | Tags: depthofcoverage Hi, I tried to use DepthOfCoverage to figure out the coverage of 50 whole exome sequencing data. It works fine for single sample, however, the program always complained about memory issue, even I provide 100 GB as "-Xmx100g". Any suggestions for the problem? There is a option "--read_buffer_size", which seems to be helpful, how should set the value ? Created 2013-12-18 21:52:37 | Updated | Tags: documentation depthofcoverage parallelism http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html states (bold is my emphasis): Parallelism options This tool can be run in multi-threaded mode using this option. TreeReducible (-nt) Yet when I run the GATK with -nt, I get an error that -nt is not supported (truncated to save space): ERROR A USER ERROR has occurred (version 2.8-1-g932cd3a): ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis DepthOfCoverage aggregates results by interval. Due to a current limitation of the GATK, analyses of this type do not currently support parallel execution. Please run your analysis without the -nt option. Created 2013-12-10 03:05:44 | Updated 2013-12-10 16:18:36 | Tags: depthofcoverage I have used the tool DepthOfCoverage to stat the depth of my BAM file. my script is as below: java -Xmx4g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T DepthOfCoverage -I IonXpress010.bam -o my.depth But I'm confuse about the result in the *depth file. for example, > my.depth\n Locus Total_Depth Average_Depth_sample Depth_for_none\n **13:32890500** 0 0.00 0\n 13:32890501 30 30.00 30\n but in my *.bam file, I tried the order below , while the result is 7, which is not same as 0 in my.depth. samtools view *.bam | awk '3==13 && $4<=32890500 &&$4+length($10)>=32890500' - |wc \n$ 7

Sorry for my poor English. can anybody meet this problem ? and tell me how the tools works? Thanks a lot..

Created 2013-11-22 09:36:51 | Updated | Tags: depthofcoverage gatk

I am using this command to calculate the depth of coverage with three different formats for the EXOME_interval.list : java -Xmx2048m -jar GenomeAnalysisTK.jar -T DepthOfCoverage -I /home/vsinha/software/bam/group1.READS.bam.list -L /home/vsinha/software/EXOME.interval_list -R /home/vsinha/software/human_g1k_v37.fasta -dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable --minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 --includeRefNSites --countType COUNT_FRAGMENTS -o /home/vsinha/software/group1.DATA

1st format : 1 69114 69332 1 69383 69577 1 69644 69940 1 69951 70028 1 566170 566275 1 801895 801983 1 802332 802416 1 861272 861420

Error : ERROR MESSAGE: Badly formed genome loc: Contig '1 69114 69332' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

2nd format: 1:69114-69332 1:69383-69577 1:69644-69940 1:69951-70028 1:566170-566275 1:801895-801983 1:802332-802416 1:861272-861420

Error: Badly formed genome loc: Contig '1' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

3rd Format: chr1:69114-69332 chr1:69383-69577 chr1:69644-69940 chr1:69951-70028 chr1:566170-566275 chr1:801895-801983 chr1:802332-802416 chr1:861272-861420

Error: Contig chr1 not present in sequence dictionary for merged BAM header: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]

Can you please let me know what is the problem in my file.

Regards Vishal

Created 2013-11-14 13:42:33 | Updated | Tags: depthofcoverage

Are secondary alignments, and duplicates taken into consideration by DepthOfCoverage? How can you skip secondary alignments for coverage estimation?

Created 2013-11-13 19:28:47 | Updated | Tags: depthofcoverage

Hi,

im trying to calculate the coverage of a exome sequencing project.

My Code is  java -jar $gatk_jar \ -T DepthOfCoverage \ -R$ref_genome \ -I "$sample_name".recal_reads.bam \ -o$result_dir/$sample_name/coverage/"$sample_name" \ -geneList $ref_refseq \ -L test.bed \ -ct 10 -ct 20 -ct 30  but i get an ERROR that i cant solve... ##### ERROR MESSAGE: Unknown file is malformed: Could not parse location from line: chr1 66999814 67000061 NM_032291_exon_0_10_chr1_66999825_f 0 + This error relates to my refseq_file but i cant find the problem with it. This is the first line from my geneList so something is messed up. I created it along http://www.broadinstitute.org/gatk/guide/article?id=1329 and used the script sortByRef.pl to sort it by the fai file from your bundle Can you give me any advise to check on ? Thank you very much for helping! Created 2013-10-23 17:48:44 | Updated | Tags: depthofcoverage I used following command to get coverage data for entire genome: java -Xmx2g -jar ~/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T DepthOfCoverage -R ~/Gmax_189.fa -o DoCov_dedup -I dedup.bam.list Coverage was not reported for 36025109 bases. Many skipped bases fall one after another in the genome to form big continuous regions. I do not understand, why these regions were skipped. Usually if there was no mapping, the coverage value should be reported as 0. Appreciate some insight into this issue! Thank you. Created 2013-10-15 18:45:52 | Updated | Tags: depthofcoverage Hello, I'm trying to calculate depth of coverage for entire contigs for multiple samples. I have ran the following command: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R Ecoli/Ecoli.allSubTypes.fasta \ -T DepthOfCoverage \ -o ../Ecoli/all.tmpOut \ -I Ecoli/bamlist.list \ -geneList Ecoli/Ecoli.refSeq Where I've tried to generate a refSeq file with one line per contig. I was expecting to have the output be in the form of a matrix with the various contigs as rows and the samples as columns. Instead I got this looking file: Locus Total_Depth Average_Depth_sample Depth_for_sample1 gi|312944605|gb|CP001855.1|:1 0 0.00 0 gi|312944605|gb|CP001855.1|:2 0 0.00 0 gi|312944605|gb|CP001855.1|:3 0 0.00 0 gi|312944605|gb|CP001855.1|:4 0 0.00 0 gi|312944605|gb|CP001855.1|:5 0 0.00 0 gi|312944605|gb|CP001855.1|:6 0 0.00 0 gi|312944605|gb|CP001855.1|:7 0 0.00 0 gi|312944605|gb|CP001855.1|:8 0 0.00 0 gi|312944605|gb|CP001855.1|:9 0 0.00 0 were each base is a row. right? What am I doing wrong? Thanks! Moran. Created 2013-10-04 14:01:32 | Updated | Tags: depthofcoverage gatk Hi there! I tried to analyse the depth of coverage of my exome data using GATK DepthOfCoverage java -Xmx${heap}m \ -Djava.io.tmpdir=$processed_bam/tmp_folder_dept \ -jar$gatk \ -T DepthOfCoverage \ -omitBaseOutput \ -omitLocusTable \ -R $GRCh37_ref_WholeGenome \ -I$file \ -o $coverage/${samplename}.coverage.dept I am not interested in particular genomic regions, so I haven't a target list file. How were the intervals determined in "_interval_statistics" output file? Where can I obtain an interval target list file including all exons in the human genome? Thank you very much!

Fulvio

Created 2013-09-26 08:51:52 | Updated | Tags: depthofcoverage gatk ensembl hgnc list

Hello,

I'm currently using DepthOfCoverage with -L option and a BED intervals file. I would to get gene report also. I'm not familiar with RefSeq format for the gene list.

I work with HGNC and Ensembl genes. I could get a gene list file with the following format:

HGNC chr start end

But this doesn't seem to work. Any suggestions on how I could achieve that or how I could generate a valid gene list file with my current intervals?

Thank you.

Created 2013-09-25 20:06:45 | Updated | Tags: depthofcoverage

Does the walker DepthOfCoverage support multi-threading?

According to the documentation: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html "This tool can be run in multi-threaded mode using this option. TreeReducible (-nt)"

When I try with GATK2.5 I get this error message: "Invalid command line: Argument nt has a bad value: The analysis DepthOfCoverage aggregates results by interval. Due to a current limitation of the GATK, analyses of this type do not currently support parallel execution. Please run your analysis without the -nt option."

Would it work, if I upgraded to 2.7? You upgrade GATK more frequently than I change my under wear, so it is difficult to keep up :) Thanks!

Created 2013-09-18 05:08:04 | Updated | Tags: depthofcoverage notprimaryalignmentfilter

Can someone point me to documentation on how I can run DepthOfCoverage without the NotPrimaryAlignmentFilter?

Created 2013-09-12 11:18:16 | Updated | Tags: depthofcoverage

Hello,

I'm runng DOC on my exome data and the output is hard to connect. This is the command i'm using:  java -Xmx2g -jar ~/programs/GenomeAnalysisTK-2.2-2-gf44cc4e/GenomeAnalysisTK.jar -T DepthOfCoverage -R ~/dependencies/reference/hg19.fasta -o doc -I doctest.bam -geneList:REFSEQ ../refSeqGenes.txt -L ~/dependencies/bedfiles/coverage/500genes.exons.refGene.sort.bed -ct 1 -ct 5 -ct 10 -ct 20 -ct 50

The refseqfile looks like this:  75 NM_003036 chr1 + 2160133 2241652 2160205 2238204 7 2160133,2234416,2234723,2235278,2235731,2237458,2238015, 2161174,2234542,2234839,2235541,2236024,2237689,2241652, 0SKI cmpl cmpl 0,0,0,2,1,0,0, 637 NM_001195563 chr1 + 6845383 6932107 6845590 6931825 4 6845383,6880240,6885151,6931816, 6845635,6880310,6885270,6932107, 0 CAMTA1 cmpl cmpl 0,0,1,0, 676 NM_000302 chr1 + 11994723 12035599 11994836 12034865 19 11994723,12008032,12009829,12010413,12012679,12014886,12016973,12017898,12018572,12020702,12023588,12024231,12024700,12025536,12026307,12027043,12030726,12032928,12034709, 11994912,12008124,12009963,12010577,12012792,12014950,12017071,12018000,12018704,12020824,12023693,12024357,12024842,12025650,12026373,12027148,12030873,12033054,12035599, 0 PLOD1 cmpl cmpl 0,1,0,2,1,0,1,0,0,0,2,2,2,0,0,0,0,0,0,

The exon bed file looks like this:  chr1 2160205 2161174 NM_003036_cds_0_0_chr1_2160206_f 0 + chr1 2234416 2234542 NM_003036_cds_1_0_chr1_2234417_f 0 + chr1 2234723 2234839 NM_003036_cds_2_0_chr1_2234724_f 0 + chr1 2235278 2235541 NM_003036_cds_3_0_chr1_2235279_f 0 + chr1 2235731 2236024 NM_003036_cds_4_0_chr1_2235732_f 0 + chr1 2237458 2237689 NM_003036_cds_5_0_chr1_2237459_f 0 + chr1 2238015 2238204 NM_003036_cds_6_0_chr1_2238016_f 0 + chr1 6845590 6845635 NM_001195563_cds_0_0_chr1_6845591_f 0 + chr1 6845590 6845635 NM_015215_cds_0_0_chr1_6845591_f 0 + chr1 6880240 6880310 NM_001195563_cds_1_0_chr1_6880241_f 0 + chr1 6880240 6880310 NM_015215_cds_1_0_chr1_6880241_f 0 + chr1 6885151 6885270 NM_001195563_cds_2_0_chr1_6885152_f 0 + chr1 6885151 6885270 NM_015215_cds_2_0_chr1_6885152_f 0 + chr1 6931816 6931825 NM_001195563_cds_3_0_chr1_6931817_f 0 + chr1 7151363 7151431 NM_015215_cds_3_0_chr1_7151364_f 0 +

The gene_summary file shows coverage on all genes:  SKI 73147 33.45 73147 33.45 21 32 45 100.0 99.8 93.4 75.8 16.6 CAMTA1 8158 33.57 8158 33.57 25 33 47 100.0 100.0 100.0 81.5 17.3 UNKNOWN 5312955 54.97 5312955 54.97 32 49 74 99.7 98.2 96.2 87.9 48.3 PLOD1 85711 39.24 85711 39.24 26 34 48 100.0 99.5 97.3 84.8 22.8 HSPG2 407112 30.90 407112 30.90 19 29 41 99.5 98.0 91.8 72.3 13.6

and the interval_summery file shows all exons:  chr1:2160206-2161174 31939 32.96 31939 32.96 18 32 44 100.0 100.0 93.3 70.4 16.4 chr1:2234417-2234542 7165 56.87 7165 56.87 41 60 71 100.0 100.0 100.0 100.0 62.7 chr1:2234724-2234839 4946 42.64 4946 42.64 42 43 45 100.0 100.0 100.0 100.0 0.0 chr1:2235279-2235541 9542 36.28 9542 36.28 30 38 46 100.0 100.0 100.0 100.0 10.3 chr1:2235732-2236024 7763 26.49 7763 26.49 24 28 32 100.0 100.0 100.0 90.8 0.0 chr1:2237459-2237689 9711 42.04 9711 42.04 34 49 53 100.0 100.0 100.0 88.3 42.9 chr1:2238016-2238204 2081 11.01 2081 11.01 9 13 15 100.0 97.9 57.7 0.0 0.0 chr1:6845591-6845635 655 14.56 655 14.56 14 16 17 100.0 100.0 100.0 0.0 0.0

This information is used to verify that a certain variant was covered good, not only on the position of the variant but also over the rest of the gene it is in. This is a bit hard to see in the interval_summery file because I can't see the exon or genenames in there and I can't see this in the gene_summery file because i mis individual exon information. This means crossrefferencing the bedfile and interval_summery with ugly scripting or creating a spoofed refseq file per exon. Is there an option to format the output where the coverage of each exon is shown with the exon and/or gene name (from the info in the bed or refseq file)?

e.g. something like this:  chr1:2160206-2161174 NM_003036_cds_0_0_chr1_2160206_f SKI 31939 32.96 31939 32.96 18 32 44 100.0 100.0 93.3 70.4 16.4

Regards,

Martin

Created 2013-08-29 01:11:12 | Updated | Tags: depthofcoverage haplotypecaller downsampling

Hello!

Trying to downsample in an orderly fashion in the name of experimentation, and in doing so would like to specify just one chromosome for the experiment - so I picked chromosome 17 with -L and a coverage of 30x with -dcov 30. This came up:

##### ERROR MESSAGE: Locus-based traversals (ie., Locus and ActiveRegion walkers) require a minimum -dcov value of 200 when downsampling to coverage. Values less than this can produce problematic downsampling artifacts while providing only insignificant improvements in memory usage in most cases.

I was hoping to poke through results from using the HaplotypeCaller with many different simulated depths of coverage for several samples. I read that one can use -dfrac instead, and that it might even be more appropriate, though I was hoping to find out what level of coverage led to what level of results and using -dfrac feels much less specific as it appears to toss a fraction of however many reads where at a given position, rather then tossing reads over a certain coverage. Thus with -dfrac, I could say that my sample had an average of 30x for this chromosome and I tossed half so theoretically I've simulated 15x depth of coverage...

Which approach would be more representative of reality? Using -dfrac to simulate a certain depth of coverage, or -dcov assuming I didn't have the 200 restriction?

Thanks for any help/discussion! -Tristan

Created 2013-08-11 15:39:42 | Updated | Tags: depthofcoverage gatk

I'm a bit confused about some of the output of GATK's DepthOfCoverage script. In the *.GATK.covdepth.sample_summary, what does the "total" column signify? Is it the total number of base pairs in the data? Because my raw read data has ~37.2 billion bp. However, GATK reports over 92 billion bp. Does the 'total' mean something else, or will I have to rerun the DepthofCoverage script?

Created 2013-07-22 11:09:51 | Updated | Tags: depthofcoverage snp vcf intervals non-human filter

Hi team,

This is two separate questions:

1. Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?

2. I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.

Many thanks and apologies if I've missed anything in the instructions.

Cheers,

Blue

Created 2013-06-26 17:38:55 | Updated 2013-06-26 17:42:50 | Tags: depthofcoverage gatkreport basecoveragedistribution

In the output grp file,

#:GATKReport.v1.1:1
#:GATKTable:3:880:%s:%s:%s:;
#:GATKTable:BaseCoverageDistribution:A simplified GATK table report
Coverage  Count    Filtered
0  2859049   2932784
1   856997    837791
2   288587    276253
3    95618     91703

what's the meaning of the three columns?

Thanks,

Created 2013-06-24 15:58:07 | Updated 2013-06-24 15:59:43 | Tags: depthofcoverage refseqcodec exome

Hi I am trying to run the DepthofCoverage walker from GATK for the first time and I am running into stumbling blocks. I am interested in getting coverage info for genes and I downloaded the Refseq genes list and am filtering and sorting according to specifications. I am using the SortByRef.pl script provided in the website. However, I am getting an error saying:

Can not open temporary file 1045: Too many open files at SortByRef.pl line 68, <\$INPUT> line 1021.

I also checked the soft/hard limits for filesystem and it shows that my current system's limit is the following:

ulimit -u :386246
ulimit -S -u: 386246  

I am stuck at this point and don't know how to proceed.

Any help is much appreciated.

Thanks

Created 2013-05-16 15:37:00 | Updated 2013-05-16 15:37:50 | Tags: depthofcoverage bam gatk

Hello,

i have spend many hours trying to run GATK depthofcoverage but it never works. last try: -T DepthOfCoverage -R /home/remi/Analyse/CNV/ERR125905/bam/Chloroplastgenomebarley.fa -I /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam -L /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bed -o /home/remi/Analyse/CNV/FishingCNV_1.5.3/out/ERfiltre.bam.coverage --minMappingQuality 15 --minBaseQuality 10 --omitDepthOutputAtEachBase --logging_level INFO --summaryCoverageThreshold 5 --summaryCoverageThreshold 7 --summaryCoverageThreshold 10 --summaryCoverageThreshold 15 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 50

My BAM header seem to be malformed.

ERROR MESSAGE: SAM/BAM file /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups

here is the 1rst line of the header:

@SQ SN:Chloroplastgenomebarley LN:136462 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB??

I Have search in the forum and doc about it. I have try to reorder my header with picard:

@HD VN:1.4 SO:unsorted @SQ SN:Chloroplastgenomebarley LN:136462 UR:file:/home/remi/Analyse/REFGEN/Chloroplastgenomebarley.fa M5:7a7b36ef01cc1a2af1c8451ca3800f93 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB??  but no more change.

Someone can help me please ?

Regards, Remi



Created 2013-05-15 09:46:39 | Updated 2013-05-15 09:47:04 | Tags: depthofcoverage intervals

Hi there,

this is my interval_list

chr1 762095 762275 LINC00115|NR_024321 chr1 762280 762414 LINC00115|NR_024321 chr1 762420 762565 LINC00115|NR024321 chr1 777259 777349 LOC643837 chr1 777391 777481 LOC643837 chr1 777482 777642 LOC643837_ chr1 783061 783151 LOC643837 chr1 792270 792446 LOC643837 chr1 861266 861496 NM_152486|SAMD11 chr1 865582 865787 NM_152486|SAMD11 chr1 866331 866507 NM_152486|SAMD11

and this is the output from the sample_interval_summary

chr1:762095-762275 ... chr1:762280-762414 ... chr1:762420-762565 ... chr1:777259-777349 ... chr1:783061-783151 ... chr1:792270-792446 ... chr1:861266-861496 ... chr1:865582-865787 ... chr1:866331-866507 ...

why am I missing two exons?

this is my cmd:

java -Xmx32g -jar /local/apps/gatk/2.5-2-gf57256b/GenomeAnalysisTK.jar -I sample.bam -R .../genome.fa -T DepthOfCoverage -o jtn -geneList hg19.tsv -L exons.list --omitDepthOutputAtEachBase --includeDeletions --interval_merging OVERLAPPING_ONLY -l INFO

/M

Created 2013-04-16 15:17:04 | Updated 2013-04-16 15:17:32 | Tags: depthofcoverage

I am running GATK DepthOfCoverage one a bunch of samples (sequenced using Illumina MiSeq). For one of the samples, I am getting the following error:

ERROR MESSAGE: SAM/BAM file SAMFileReader{<file.bam>} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 61; please see the GATK --help documentation for options related to this error

I found a suggestion that I should be using fixMisencodedQuals flag in this case. However, if I do that, "the engine will simply subtract 31 from every quality score as it is read in". That will fix this error, but then all my quality scores will be incorrect.

Furthermore, the BAM file I am using was last recalibrated with GATK. Why is GATK producing files with invalid quality scores?

What should I do?

Any help would be much appreciated.

Created 2013-04-06 21:52:36 | Updated | Tags: depthofcoverage

Hello, I´m trying to use deptofcoverage to check the coverage of my reads. I already have the indexed bam files (created with sam to bam) and also reordered (reorder SAM/BAM) but they are still not recognized by depthofcoverage and I got this error message:

"Sequences are not currently available for the specified build"

I used "human (homo sapiens) hg19 full" for mapping but I can´t select it, it only allows b37 version.

Any suggestions?

Thank you very much in advance

Gema

Created 2013-03-18 10:56:54 | Updated | Tags: depthofcoverage refseqcodec website

Dear team,

I would like to include a RefSeq ROD file in order to get the coverage per gene using the GATK coverage tool (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html). However, it is not really clear to me how I can easily generate such a file, since I can not find the right documentation. All the links that should point to this information seem to be (incorrectly?) redirected to the main GATK homepage). For example this http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_refseq_RefSeqCodec.html has a link that points to http://www.broadinstitute.org/gsa/wiki/index.php/RefSeq which is redirected to http://www.broadinstitute.org/gatk/. Can someone point me to the correct docs? Other resources I found also point to the same wiki, which I can't find at the moment..

Kind regards, JJ

Created 2013-03-11 11:12:22 | Updated 2013-03-11 12:09:43 | Tags: unifiedgenotyper depthofcoverage annotation

Just in the process of updating our pipeline from v2.3-4-gb8f1308 Lite to v2.4-7-g5e89f01 Academic and have run into a small issue. The command line:

-T UnifiedGenotyper -glm SNP -R /lustre/scratch111/resources/ref/Homo_sapiens/1000Genomes_hs37d5/hs37d5.fa -I /lustre/scratch111/projects/helic/vcf-newpipe/lists/chr1-pooled.list --alleles /lustre/scratch111/projects/helic/vcf-newpipe/pooled/1/1:1-1000000.snps.vcf.gz -L 1:1-1000000 -U LENIENT_VCF_PROCESSING -baq CALCULATE_AS_NECESSARY -gt_mode GENOTYPE_GIVEN_ALLELES -out_mode EMIT_ALL_SITES --standard_min_confidence_threshold_for_calling 4.0 --standard_min_confidence_threshold_for_emitting 4.0 -l INFO -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A DepthOfCoverage -o /lustre/scratch111/projects/helic/vcf-newpipe/pooled/1/1:1-1000000.asnps.vcf.gz.part.vcf.gz

This worked in 2.3.4. But now gives:

Invalid command line: Argument annotation has a bad value: Class DepthOfCoverage is not found; please check that you have specified the class name correctly

I've looked at the release notes but it's not giving me a clue as to what has changed. Has the DepthOfCoverage annotation now been dropped? I've checked and I can reproduce this on the latest nightly (nightly-2013-03-11-g184e5ac)

Created 2013-03-05 15:52:53 | Updated 2013-03-05 15:56:27 | Tags: depthofcoverage diagnosetargets

Now that DepthOfCoverage is being retired in GATK 2.4, I decided to investigate DiagnoseTargets. I have some questions.

1) Are there plans to support output formats other than VCF? What was great about DOC is I could easily send the output to anyone and it could be easily read. With DT, that requires additional processing.

2) DOC provided summary for intervals as well as samples. DT only does intervals. Is there a way to get per-sample info?

3) DOC output full intervals. DT only outputs the start positions. To get the end positions, an additional step is required. Can that be adjusted?

4) DOC provided coverage info for all intervals. DT only shows covered intervals, so if an interval is not covered, it will not be listed in the output. Is there a way to output all intervals?

Sorry if I sound too critical. I am a big fan of DepthOfCoverage and am disappointed to see it go.

Thank you.

Created 2013-03-04 15:38:03 | Updated | Tags: depthofcoverage indels

Hello,

For matched tumor and normal pairs, we easily get insertion and deletion counts from the output of Somatic Indel Detector in GATK. However, when we run multiple samples from the same patient, sometimes calls are made in one sample but not another, so we might not have the numbers for all samples for all indel events. We can get the deletion counts from Depth of Coverage in GATK, but retrieving insertions is trickier.

Does you have a suggestion for how to solve this problem in an automated (ie non-IGV fashion)?

Additionally, as DepthofCoverage is being retired, what do you recommend that we use for getting SNP and deletion counts?

Thank you

Created 2012-10-22 17:31:20 | Updated 2012-10-22 18:22:28 | Tags: depthofcoverage refseq

Hi,

I am learning to use the DepthofCoverage function to obtain the gene coverage information for a collection of bacterial contigs that were mapped with metagenomic reads. The original post introducing this function is here: http://gatkforums.broadinstitute.org/discussion/40/depthofcoverage-v3-0-how-much-data-do-i-have#latest

In the post, you mentioned the gene list, as follow:

-geneList /path/to/gene/list.txt

The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,`

I have three inquiries:

1. Can you please provide headers to the values in each column?
2. I am working with bacterial genomic contigs, can you please specify what basic information is needed for a gene list (e.g., name of contig, name of gene, location of gene in the contig, from... to ..., etc.)?

Thanks so much!

Leo

Created 2012-10-15 14:43:35 | Updated 2012-10-18 00:43:47 | Tags: depthofcoverage bam

I am getting the following error when running DepthOfCoverage:

I have already reheadered my bam file to fix a contig mismatch error, and the fasta dict file was generated automatically by gatk. Moreover, about 160 lines of output are generated, but I do not see any irregularities at the position where the code crashed. Please let me know what I can try. Thank you.

Created 2012-09-23 14:29:10 | Updated 2013-01-07 20:38:43 | Tags: depthofcoverage

when " RMDTrackBuilder - Loading Tribble index from disk for file refGene.sorted.txt " error occurs:

##### ERROR MESSAGE: Badly formed genome loc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary

is it means that my refGene.sorted.txt or b37.fasta file is badly formed? what should I do ?

Created 2012-08-23 15:27:54 | Updated 2013-01-07 19:11:47 | Tags: depthofcoverage duplicatereadfilter

I was wondering if there is an option to remove duplicate reads when the coverage is determined using DepthOfCoverage from a .BAM file. Or is there an alternate way to remove the duplicate reads.

Thanks Amin