# Tagged with #depthofcoverage 3 documentation articles | 1 announcement | 49 forum discussions

### Overview

This document covers the use of various tools and metrics associated with depth of coverage.

### Coverage annotations: DP and AD

The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:

• DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
• Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.

The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.

Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.

TO BE CONTINUED... meanwhile, check out the DepthOfCoverage and DiagnoseTargets tool docs.

A new tool has been released!

Check out the documentation at DepthOfCoverage.

#### See this announcement regarding our plans for support of DepthOfCoverage and DiagnoseTargets. If you find that there are functionalities missing in either tool, leave us a comment and we will consider adding them.

For a complete, detailed argument reference, refer to the GATK document page here.

### Introduction

DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:

• Total, mean, median, and quartiles for each partition type: aggregate
• Total, mean, median, and quartiles for each partition type: for each interval
• A series of histograms of the number of bases covered to Y depth for each partition type (granular; e.g. Y can be a range, like 16 to 22)
• A matrix of counts of the number of intervals for which at least Y samples and/or read groups had a median coverage of at least X
• A matrix of counts of the number of bases that were covered to at least X depth, in at least Y groups (e.g. # of loci with ≥15x coverage for ≥12 samples)
• A matrix of proportions of the number of bases that were covered to at least X depth, in at least Y groups (e.g. proportion of loci with ≥18x coverage for ≥15 libraries)

Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.

For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.

DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.

### Coverage by Gene

To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument

-geneList /path/to/gene/list.txt


The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,
587     NM_001005277    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F16  cmpl    cmpl    0,
587     NM_001005221    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F29  cmpl    cmpl    0,
589     NM_001005224    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F3   cmpl    cmpl    0,
589     NM_001005277    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F16  cmpl    cmpl    0,
589     NM_001005221    chr1    -       610958  611897  610958  611897  1       610958, 611897, 0       OR4F29  cmpl    cmpl    0,


If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at

/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt


If you supply the -geneList argument, DepthOfCoverage will output an additional summary file that looks as follows:

Gene_Name     Total_Cvg       Avg_Cvg       Sample_1_Total_Cvg    Sample_1_Avg_Cvg    Sample_1_Cvg_Q3       Sample_1_Cvg_Median      Sample_1_Cvg_Q1
SORT1    594710  238.27  594710  238.27  165     245     330
NOTCH2  3011542 357.84  3011542 357.84  222     399     &gt;500
LMNA    563183  186.73  563183  186.73  116     187     262
NOS1AP  513031  203.50  513031  203.50  91      191     290


Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.

We have decided to continue providing and supporting DepthOfCoverage and DiagnoseTargets for the foreseeable future. Going forward, we'll try to integrate them and develop their features to address the main needs of the community. To this end we welcome your continuing feedback, so please feel free to contribute comments and ideas in this thread.

To all who took the time to tell us what you find useful about DoC and DT (and what you wish it could do), a big thank you! This is always very useful to us because it helps us identify which features are most valuable to our users.

When generating the coverage metrics, how does DepthofCoverage treat half-mapped reads (reads where only 1 arm of the mate pair mapped): are they treated like unmapped reads or counted towards coverage for the mapped arm?

Hello GATK team,

I'm running the Best Practice pipeline, and I would like to do some coverage calculation on my samples. I want to use DepthOfCoverage, but I can't decide which bam file to use - the original one, the one after BQSR stage (right before HC) or the bamout of HC? The point is that the first two files don't contain the reads as they were mapped in the variant calling, since the HC changes the mapping dramatically. The bamout contains haplotypes, and therefor reads are merged together, and the coverage might look lower than the one that uses the reads.

What do you recommend? Maya

Hello, I'm trying to understand why gte_25 != %_bases_above_25

sample_summary->%_bases_above_25 == 88.6%

sample_cumulative_coverage_proportions->gte_25 == 0.91

I would expect both to be identical but they never are.

They are always close though.

There is something in the calculation that eludes me.

Thanks Louis

HaplotypeCaller has a --min_base_quality_score flag to specify the minimum "base quality required to consider a base for calling". It also has the -stand_call_conf and -stand_emit_conf that both of which use a "minimum phred-scaled confidence threshold".

What is the difference between the base quality flag and the phred-scaled confidence thresholds flags in deciding whether or not to call a variant?

Also, why are there separate flags for calling variants and emitting variants? To my way of thinking, calling and emitting variants are one-and-the-same.

Is there any way to specify a minimum minor-allele-frequency for calling variants in Haplotyper? Under the default settings, the program expects 1:1 ratio of each allele in a single diploid organism. However, stochastic variance in which RNA fragments are sequenced and retained could lead to departures from this ratio.

Finally, is there any way to specify the minimum level of coverage for a given variant for it to be considered for calling/genotyping?

Hi GATK team,

I think I got some problem when I tried to analysis the GATK DOC results to find the rate of intervals with coverage >=100X in my targeted sequence data.

When I check the XXX.sample_interval_statistics file, just use awk -F"\t" '{print $102}' XXX.sample_interval_statistics, it gave me the following result: depth>=100 372 but when I checked the interval_summary file, I tried to do tail -n +2 XXX.sample_interval_summary | awk -F"\t" '$3>=100' | wc -l, it gave me

387

different from the previous 372. It means the third column in XXX.sample_interval_summary doesn't present the common "X coverage"? Does I misunderstand something here?

Thank you very much.

bless~ XL

Hi,

When I check the coordinates of intervals given by DepthOfCoverage "*.interval_summary" file, I found that it tried to add one on the start position of each intervals. For example, the original coordinates in my .bed interval file is 3669022-3669474, then the DOC will give 3669023-3669474, only the start position, not the end one, actually.

I am not sure whether this is caused by "start from 0" or "start from 1" problem, because I have no idea about the coordinates in my original bed files start from 1 or start from 0.

I just worry does GATK DOC will miss one base for each interval?

bless~ XL

I'm getting the following error in GATK 3.2-2:

MESSAGE: SAM/BAM file /home/dwragg/work/Analysis/TEST/AOC35_ATCACG_L008/AOC35_ATCACG_L008_bootstrap.bam is malformed: Program record with group id GATK PrintReads already exists in SAMFileHeader!

When attempting to calclate depth of coverage:

java -d64 -jar ${GATK}/GenomeAnalysisTK.jar \ -T DepthOfCoverage \ -R${REF} \ -I ${OUT}/${SAMPLE}/${SAMPLE}_bootstrap.bam \ -o${OUT}/${SAMPLE}/metrics/${SAMPLE}_GATKcov \ -ct 2 -ct 5 -ct 8 \ --omitDepthOutputAtEachBase \ --omitIntervalStatistics \ --omitLocusTable \ -l FATAL

I'm assuming this is not normal as everything was working fine in 3.1-1. Ironically I recently switched to 3.2-2 because the RealignerTargetCreator was giving me an Unsupported major.minor version 51.0 error. I'm running Java 1.7.0-b147.

[EDIT}

I've since noticed that the BAM file was created with GATK version 2.4-9-g532efad. So the 'current' and 'latest' software filing system in place here appears to have failed. I can confirm that GATK v3.2-2 DepthOfCoverage tool generates the above program record group id error. I'll chase the administrators down to ensure the latest GATK version is installed and start from scratch.

Hi,

I have used the following commands using DepthOfCoverage tool with two different bed files:

  java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ucsc_hg19.fa -I WT_recalibrated.bam -L coverage_summary.bed -ct 1 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100 -o WT_cov


The line count for the input and output:

$wc -l WT_cov.sample_interval_summary 4988 WT_cov.sample_interval_summary$ wc -l coverage_summary.bed
10585 coverage_summary.bed


In the other case:

  java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ucsc_hg19.fa -I WT_recalibrated.bam -L exon.bed -ct 1 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100 -o WT_exon


Line count for the input and output:

 $wc -l WT_exon.sample_interval_summary 5065 WT_exon.sample_interval_summary$ wc -l exon.bed
5065 exon.bed


The input in both the cases is of the standard format as shown below:

 chr1    6529578 6529755
chr1    6530273 6530442
chr1    6530543 6530721
chr1    6530773 6530980
chr1    6531028 6531730
chr1    6531768 6531914
chr1    6532563 6532713
chr1    6533023 6533273


Could anyone help to interpret the discrepancy between number of target regions in bed file and _interval_summary file in the above two cases?

Hi all,

I am using depthofCoverage tool as shown below:

  java -Xmx2g -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R  ucsc_hg19.fa -geneList:REFSEQ refGene.bed -L exons.bed -ct 1 -ct 5 -ct 10 -ct 20 -ct 50 -I WT_recalibrated.bam -o cov


Original exons.bed file looks like:

 chr1    6526151 6527632 PLEKHG5_1       0       +
chr1    6527884 6528646 PLEKHG5_2       0       +
chr1    6529101 6529301 PLEKHG5_3       0       +
chr1    6529394 6529510 PLEKHG5_4       0       +
chr1    6529603 6529736 PLEKHG5_5       0       +
chr1    6530295 6530415 PLEKHG5_6       0       +
chr1    6530565 6530703 PLEKHG5_7       0       +
chr1    6530794 6530944 PLEKHG5_8       0       +
chr1    6531049 6531697 PLEKHG5_9       0       +


Then, refGene.bed file is prepared based on the awk solution provided at the thread 'http://gatkforums.broadinstitute.org/discussion/3229/exon-output-depth-of-coverage'

refGene.bed

1       PLEKHG5_exon_1       chr1    +       6526151 6527632 6526151 6527632 1       6526151 6527632 0       PLEKHG5 exon_1  cmpl    cmpl    0
2       PLEKHG5_exon_2       chr1    +       6527884 6528646 6527884 6528646 1       6527884 6528646 0       PLEKHG5 exon_2  cmpl    cmpl    0
3       PLEKHG5_exon_3       chr1    +       6529101 6529301 6529101 6529301 1       6529101 6529301 0       PLEKHG5 exon_3  cmpl    cmpl    0
4       PLEKHG5_exon_4       chr1    +       6529394 6529510 6529394 6529510 1       6529394 6529510 0       PLEKHG5 exon_4  cmpl    cmpl    0
5       PLEKHG5_exon_5       chr1    +       6529603 6529736 6529603 6529736 1       6529603 6529736 0       PLEKHG5 exon_5  cmpl    cmpl    0


However, the above command returns the error:

 ##### ERROR MESSAGE: Error parsing line at byte position: htsjdk.tribble.readers.LineIteratorImpl@5916c00a, for input source:refGene.bed


Could someone help to fix this. This is not basically a refGene file. This is prepared from a custom exon.bed file which doesn't have refGene notations using the awk solution in the above thread.

Greetings,

I am trying to run a depth of coverage calculation as part of the XHMM workflow for cnv analysis. Upon running the following bash code, I am returned with this error.

java -Xmx3072m -jar ./Sting/dist/GenomeAnalysisTK.jar \ -T DepthOfCoverage -I group1.READS.bam.list -L EXOME.interval_list \ -R ./human_g1k_v37.fasta \ -dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable \ --minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 \ --includeRefNSites \ --countType COUNT_FRAGMENTS \ -o group1.DATA

##### ERROR MESSAGE: Argument with name 'includeRefNSites' isn't defined.

Thanks!

Steven

I've been learning how to use the DepthOfCoverage to calculate coverage across genes. I noticed that some genes were covered by the bam, in the interval_list file, and in the geneList file, but were not reported in the gene_summary table. Most of these were non-coding RNAs, and in reviewing the geneList file, I noticed that for the non-coding RNAs, the Coding region start position is 1 base higher than the Coding region end position (which also put the Coding region start position higher than the Transcription end position). I adjusted the file and made the Coding region identical to the Transcription regions for the non-coding RNAs, and this resolved the issue for most of the genes. It appears that remaining genes that are still not reported in the gene_summary table, all overlap with exons or UTRs from other genes. My questions are:

1. Is it the case that if two exons overlap, only one will be reported on, or is something else going on?
2. What regions of the gene is the tool reporting on, the whole transcribed region, or just the coding region?
3. Am I safe in changing the coding regions of non-coding RNAs to equal the transcribed region for the coverage analysis?

Thank you.

I ran DepthOfCoverage on several samples for ~5000 genes. It turns out that a handful (~30) of these genes are overlapping (but on different strands). In these cases, it seems that DoC analyzes the first overlapping gene but not the second overlapping gene. For example, in this case (below), it will produce depth stats for the first interval, but not the second interval. You can see the end of the first interval 495620 is actually bigger than the beginning of the second interval 494949.

Pv_Sal1_chr14:493199-495620  ## On (-) strand
Pv_Sal1_chr14:494949-495542  ## On (+) strand


Is there a way to force DoC to analyze overlapping intervals? I'm using GATK v3.2.2 with the following command:

java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R ref.fasta -I bamnames.list -o coverage/test.cov -geneList coverage/test.refseq -L coverage/test.intervals -omitBaseOutput --minMappingQuality 20 --minBaseQuality 20


Thanks for any help!

I am trying to compute mean coverage (using GATK DepthOfCovearge) for a BAM file (targeting sequencing) aligned using reference hg19.

java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ucsc.hg19.fasta \
-T DepthOfCoverage \
-I my_bam.list \
-L my_targets.bed \
-o coverage


The problem reported is:

##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR   contig reads = chrM / 16569
##### ERROR   contig reference = chrM / 16571.
##### ERROR   reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
##### ERROR   reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
##### ERROR ------------------------------------------------------------------------------------------


Hi Geraldine/GATKers :-)

I was trying to access the documentation for DepthOfCoverage from this page:

But the link doesn't seem to be working. Could you possibly update the link so that I could take a look at the argument details?

Thank you so much as always! Steve

Hi we've been looking at results of a recent run of GATK-HC (3.1-1) using the new N1 pipeline and we've been seeing something odd in the distribution of the Depth of Coverage (Using DP from the Genotype Fields) we're seeing for the raw unfiltered variants.

All our samples are sequenced using PCR-Free libraries and have two lanes of sequence (~24x mapped depth) and looking at depth of coverage from bedtools we see a nice clean distribution (red in graph) but when we look at the data from the HaplotypeCaller sites (Blue in graph) we see a bimodal distribution with an excess of variants called at lower coverage (~16x) vs the most common coverage of around 24x. We've seen this in all the samples we've looked at so far, so it's not just a one off.

I've had a quick look at read depth from another variant caller (Platypus) and there we see no evidence of this bimodal distribution in the variants it has called.

Is this expected behaviour? If so why does it occur? If not any idea what is going on here, is it a bug in the variant caller or the depth statistics? Do you see the same thing in other datasets?

Thanks!

Dear all,

I have a large number of gVCF files, either single samples or combined in approximately 100 samples per combined gVCF file. I would like to compute something like the average depth for a set of regions of interest from the combined or single gVCF files. I can see that I could try to get a VCF output at every position, and use that to infer the percentage with a given read depth, but that seems mightily cumbersome.

So my question: is there an equivalent to the DepthOfCoverage GATK module that takes as input gVCF files? ideally combined gVCF and extract per sample average/median/minimum depth but otherwise I can work with single sample gVCF data.

I have a feature request:

I'm a little surprised to find that the DepthOfCoverage tool still does not allow parallelization as of v3.1-1. Would it be possible to parallelize this tool and others that consider samples/BAMs data independent. Pretty please... :-)

Thanks!

##### ERROR ------------------------------------------------------------------------------------------

Dear all,

I am interested in creating genome-wide gVCFs from various sources, including exome samples. The rationale is that it is not completely clear what the target region actually is all the time, and it would be good to keep some flexibility about what I capture.

Nevertheless, these single sample gVCF files can become quite large, but most of the lines in the file are used to store very low depth data ( DP <= 2) which is probably not critical for downstream analysis. I would be interested to have, in the HaplotypeCaller gVCF data generation process, an additional parameter that sets as missing regions with, say, 2 reads or less. That would considerably reduce the storage at little cost in terms of discarded information.

Does such a parameter exist already? Would it make sense at all to add this? I am not sure but right now it seems to me it might be useful.

Vincent

I am using DepthOfCoverage for the first time

I used the following command to get the coverage and I am getting only one output file, which gives coverage per samples. I am giving gene list in REFSEQ format along with a sorted interval BED file. However I am not getting any file with per gene coverage.

Could anyone clarify whether I am missing something in the arguments

java -Xmx8g -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T DepthOfCoverage --calculateCoverageOverGenes:REFSEQ Genes_refgene.txt --outputFormat table -o Coverage_summary -I BAM.list -L genes.bed

Thanks, Tinu

Hi,

I tried to use DepthOfCoverage to figure out the coverage of 50 whole exome sequencing data. It works fine for single sample, however, the program always complained about memory issue, even I provide 100 GB as "-Xmx100g". Any suggestions for the problem? There is a option "--read_buffer_size", which seems to be helpful, how should set the value ?

Parallelism options

This tool can be run in multi-threaded mode using this option.

TreeReducible (-nt)

Yet when I run the GATK with -nt, I get an error that -nt is not supported (truncated to save space):

ERROR A USER ERROR has occurred (version 2.8-1-g932cd3a): ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis DepthOfCoverage aggregates results by interval. Due to a current limitation of the GATK, analyses of this type do not currently support parallel execution. Please run your analysis without the -nt option.

I have used the tool DepthOfCoverage to stat the depth of my BAM file. my script is as below:

java -Xmx4g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T DepthOfCoverage -I IonXpress010.bam -o my.depth


But I'm confuse about the result in the *depth file. for example,

> my.depth\n
Locus   Total_Depth     Average_Depth_sample    Depth_for_none\n
**13:32890500**    0    0.00 0\n
13:32890501     30    30.00 30\n


but in my *.bam file, I tried the order below , while the result is 7, which is not same as 0 in my.depth.

samtools view *.bam | awk '$3==13 &&$4<=32890500 && $4+length($10)>=32890500'  - |wc \n
7  Sorry for my poor English. can anybody meet this problem ? and tell me how the tools works? Thanks a lot.. I am using this command to calculate the depth of coverage with three different formats for the EXOME_interval.list : java -Xmx2048m -jar GenomeAnalysisTK.jar -T DepthOfCoverage -I /home/vsinha/software/bam/group1.READS.bam.list -L /home/vsinha/software/EXOME.interval_list -R /home/vsinha/software/human_g1k_v37.fasta -dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable --minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 --includeRefNSites --countType COUNT_FRAGMENTS -o /home/vsinha/software/group1.DATA 1st format : 1 69114 69332 1 69383 69577 1 69644 69940 1 69951 70028 1 566170 566275 1 801895 801983 1 802332 802416 1 861272 861420 Error : ERROR MESSAGE: Badly formed genome loc: Contig '1 69114 69332' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file? 2nd format: 1:69114-69332 1:69383-69577 1:69644-69940 1:69951-70028 1:566170-566275 1:801895-801983 1:802332-802416 1:861272-861420 Error: Badly formed genome loc: Contig '1' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file? 3rd Format: chr1:69114-69332 chr1:69383-69577 chr1:69644-69940 chr1:69951-70028 chr1:566170-566275 chr1:801895-801983 chr1:802332-802416 chr1:861272-861420 Error: Contig chr1 not present in sequence dictionary for merged BAM header: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1] Can you please let me know what is the problem in my file. Thank You in advance. Regards Vishal Are secondary alignments, and duplicates taken into consideration by DepthOfCoverage? How can you skip secondary alignments for coverage estimation? Hi, im trying to calculate the coverage of a exome sequencing project. My Code is java -jargatk_jar \ -T DepthOfCoverage \ -R $ref_genome \ -I "$sample_name".recal_reads.bam \ -o $result_dir/$sample_name/coverage/"$sample_name" \ -geneList$ref_refseq \ -L test.bed \ -ct 10 -ct 20 -ct 30 but i get an ERROR that i cant solve...

##### ERROR MESSAGE: Unknown file is malformed:

Could not parse location from line: chr1 66999814 67000061 NM_032291_exon_0_10_chr1_66999825_f 0 +

This error relates to my refseq_file but i cant find the problem with it. This is the first line from my geneList so something is messed up.

I created it along http://www.broadinstitute.org/gatk/guide/article?id=1329 and used the script sortByRef.pl to sort it by the fai file from your bundle

Can you give me any advise to check on ?

Thank you very much for helping!

I used following command to get coverage data for entire genome: java -Xmx2g -jar ~/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T DepthOfCoverage -R ~/Gmax_189.fa -o DoCov_dedup -I dedup.bam.list

Coverage was not reported for 36025109 bases. Many skipped bases fall one after another in the genome to form big continuous regions. I do not understand, why these regions were skipped. Usually if there was no mapping, the coverage value should be reported as 0.

Appreciate some insight into this issue! Thank you.

Hello,

I'm trying to calculate depth of coverage for entire contigs for multiple samples. I have ran the following command: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R Ecoli/Ecoli.allSubTypes.fasta \ -T DepthOfCoverage \ -o ../Ecoli/all.tmpOut \ -I Ecoli/bamlist.list \ -geneList Ecoli/Ecoli.refSeq

Where I've tried to generate a refSeq file with one line per contig.

I was expecting to have the output be in the form of a matrix with the various contigs as rows and the samples as columns.

Instead I got this looking file: Locus Total_Depth Average_Depth_sample Depth_for_sample1 gi|312944605|gb|CP001855.1|:1 0 0.00 0 gi|312944605|gb|CP001855.1|:2 0 0.00 0 gi|312944605|gb|CP001855.1|:3 0 0.00 0 gi|312944605|gb|CP001855.1|:4 0 0.00 0 gi|312944605|gb|CP001855.1|:5 0 0.00 0 gi|312944605|gb|CP001855.1|:6 0 0.00 0 gi|312944605|gb|CP001855.1|:7 0 0.00 0 gi|312944605|gb|CP001855.1|:8 0 0.00 0 gi|312944605|gb|CP001855.1|:9 0 0.00 0

were each base is a row. right?

What am I doing wrong?

Thanks! Moran.

Hi there! I tried to analyse the depth of coverage of my exome data using GATK DepthOfCoverage java -Xmx${heap}m \ -Djava.io.tmpdir=$processed_bam/tmp_folder_dept \ -jar $gatk \ -T DepthOfCoverage \ -omitBaseOutput \ -omitLocusTable \ -R$GRCh37_ref_WholeGenome \ -I $file \ -o$coverage/${samplename}.coverage.dept I am not interested in particular genomic regions, so I haven't a target list file. How were the intervals determined in "_interval_statistics" output file? Where can I obtain an interval target list file including all exons in the human genome? Thank you very much! Fulvio Hello, I'm currently using DepthOfCoverage with -L option and a BED intervals file. I would to get gene report also. I'm not familiar with RefSeq format for the gene list. I work with HGNC and Ensembl genes. I could get a gene list file with the following format: HGNC chr start end But this doesn't seem to work. Any suggestions on how I could achieve that or how I could generate a valid gene list file with my current intervals? Thank you. Does the walker DepthOfCoverage support multi-threading? According to the documentation: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html "This tool can be run in multi-threaded mode using this option. TreeReducible (-nt)" When I try with GATK2.5 I get this error message: "Invalid command line: Argument nt has a bad value: The analysis DepthOfCoverage aggregates results by interval. Due to a current limitation of the GATK, analyses of this type do not currently support parallel execution. Please run your analysis without the -nt option." Would it work, if I upgraded to 2.7? You upgrade GATK more frequently than I change my under wear, so it is difficult to keep up :) Thanks! Can someone point me to documentation on how I can run DepthOfCoverage without the NotPrimaryAlignmentFilter? Hello, I'm runng DOC on my exome data and the output is hard to connect. This is the command i'm using: java -Xmx2g -jar ~/programs/GenomeAnalysisTK-2.2-2-gf44cc4e/GenomeAnalysisTK.jar -T DepthOfCoverage -R ~/dependencies/reference/hg19.fasta -o doc -I doctest.bam -geneList:REFSEQ ../refSeqGenes.txt -L ~/dependencies/bedfiles/coverage/500genes.exons.refGene.sort.bed -ct 1 -ct 5 -ct 10 -ct 20 -ct 50 The refseqfile looks like this: 75 NM_003036 chr1 + 2160133 2241652 2160205 2238204 7 2160133,2234416,2234723,2235278,2235731,2237458,2238015, 2161174,2234542,2234839,2235541,2236024,2237689,2241652, 0SKI cmpl cmpl 0,0,0,2,1,0,0, 637 NM_001195563 chr1 + 6845383 6932107 6845590 6931825 4 6845383,6880240,6885151,6931816, 6845635,6880310,6885270,6932107, 0 CAMTA1 cmpl cmpl 0,0,1,0, 676 NM_000302 chr1 + 11994723 12035599 11994836 12034865 19 11994723,12008032,12009829,12010413,12012679,12014886,12016973,12017898,12018572,12020702,12023588,12024231,12024700,12025536,12026307,12027043,12030726,12032928,12034709, 11994912,12008124,12009963,12010577,12012792,12014950,12017071,12018000,12018704,12020824,12023693,12024357,12024842,12025650,12026373,12027148,12030873,12033054,12035599, 0 PLOD1 cmpl cmpl 0,1,0,2,1,0,1,0,0,0,2,2,2,0,0,0,0,0,0, The exon bed file looks like this: chr1 2160205 2161174 NM_003036_cds_0_0_chr1_2160206_f 0 + chr1 2234416 2234542 NM_003036_cds_1_0_chr1_2234417_f 0 + chr1 2234723 2234839 NM_003036_cds_2_0_chr1_2234724_f 0 + chr1 2235278 2235541 NM_003036_cds_3_0_chr1_2235279_f 0 + chr1 2235731 2236024 NM_003036_cds_4_0_chr1_2235732_f 0 + chr1 2237458 2237689 NM_003036_cds_5_0_chr1_2237459_f 0 + chr1 2238015 2238204 NM_003036_cds_6_0_chr1_2238016_f 0 + chr1 6845590 6845635 NM_001195563_cds_0_0_chr1_6845591_f 0 + chr1 6845590 6845635 NM_015215_cds_0_0_chr1_6845591_f 0 + chr1 6880240 6880310 NM_001195563_cds_1_0_chr1_6880241_f 0 + chr1 6880240 6880310 NM_015215_cds_1_0_chr1_6880241_f 0 + chr1 6885151 6885270 NM_001195563_cds_2_0_chr1_6885152_f 0 + chr1 6885151 6885270 NM_015215_cds_2_0_chr1_6885152_f 0 + chr1 6931816 6931825 NM_001195563_cds_3_0_chr1_6931817_f 0 + chr1 7151363 7151431 NM_015215_cds_3_0_chr1_7151364_f 0 + The gene_summary file shows coverage on all genes: SKI 73147 33.45 73147 33.45 21 32 45 100.0 99.8 93.4 75.8 16.6 CAMTA1 8158 33.57 8158 33.57 25 33 47 100.0 100.0 100.0 81.5 17.3 UNKNOWN 5312955 54.97 5312955 54.97 32 49 74 99.7 98.2 96.2 87.9 48.3 PLOD1 85711 39.24 85711 39.24 26 34 48 100.0 99.5 97.3 84.8 22.8 HSPG2 407112 30.90 407112 30.90 19 29 41 99.5 98.0 91.8 72.3 13.6 and the interval_summery file shows all exons: chr1:2160206-2161174 31939 32.96 31939 32.96 18 32 44 100.0 100.0 93.3 70.4 16.4 chr1:2234417-2234542 7165 56.87 7165 56.87 41 60 71 100.0 100.0 100.0 100.0 62.7 chr1:2234724-2234839 4946 42.64 4946 42.64 42 43 45 100.0 100.0 100.0 100.0 0.0 chr1:2235279-2235541 9542 36.28 9542 36.28 30 38 46 100.0 100.0 100.0 100.0 10.3 chr1:2235732-2236024 7763 26.49 7763 26.49 24 28 32 100.0 100.0 100.0 90.8 0.0 chr1:2237459-2237689 9711 42.04 9711 42.04 34 49 53 100.0 100.0 100.0 88.3 42.9 chr1:2238016-2238204 2081 11.01 2081 11.01 9 13 15 100.0 97.9 57.7 0.0 0.0 chr1:6845591-6845635 655 14.56 655 14.56 14 16 17 100.0 100.0 100.0 0.0 0.0 This information is used to verify that a certain variant was covered good, not only on the position of the variant but also over the rest of the gene it is in. This is a bit hard to see in the interval_summery file because I can't see the exon or genenames in there and I can't see this in the gene_summery file because i mis individual exon information. This means crossrefferencing the bedfile and interval_summery with ugly scripting or creating a spoofed refseq file per exon. Is there an option to format the output where the coverage of each exon is shown with the exon and/or gene name (from the info in the bed or refseq file)? e.g. something like this: chr1:2160206-2161174 NM_003036_cds_0_0_chr1_2160206_f SKI 31939 32.96 31939 32.96 18 32 44 100.0 100.0 93.3 70.4 16.4 Regards, Martin Hello! Trying to downsample in an orderly fashion in the name of experimentation, and in doing so would like to specify just one chromosome for the experiment - so I picked chromosome 17 with -L and a coverage of 30x with -dcov 30. This came up: ##### ERROR MESSAGE: Locus-based traversals (ie., Locus and ActiveRegion walkers) require a minimum -dcov value of 200 when downsampling to coverage. Values less than this can produce problematic downsampling artifacts while providing only insignificant improvements in memory usage in most cases. I was hoping to poke through results from using the HaplotypeCaller with many different simulated depths of coverage for several samples. I read that one can use -dfrac instead, and that it might even be more appropriate, though I was hoping to find out what level of coverage led to what level of results and using -dfrac feels much less specific as it appears to toss a fraction of however many reads where at a given position, rather then tossing reads over a certain coverage. Thus with -dfrac, I could say that my sample had an average of 30x for this chromosome and I tossed half so theoretically I've simulated 15x depth of coverage... Which approach would be more representative of reality? Using -dfrac to simulate a certain depth of coverage, or -dcov assuming I didn't have the 200 restriction? Thanks for any help/discussion! -Tristan I'm a bit confused about some of the output of GATK's DepthOfCoverage script. In the *.GATK.covdepth.sample_summary, what does the "total" column signify? Is it the total number of base pairs in the data? Because my raw read data has ~37.2 billion bp. However, GATK reports over 92 billion bp. Does the 'total' mean something else, or will I have to rerun the DepthofCoverage script? Thanks in advance! Hi team, This is two separate questions: 1. Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ? 2. I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only. Many thanks and apologies if I've missed anything in the instructions. Cheers, Blue In the output grp file, #:GATKReport.v1.1:1 #:GATKTable:3:880:%s:%s:%s:; #:GATKTable:BaseCoverageDistribution:A simplified GATK table report Coverage Count Filtered 0 2859049 2932784 1 856997 837791 2 288587 276253 3 95618 91703  what's the meaning of the three columns? Thanks, Hi I am trying to run the DepthofCoverage walker from GATK for the first time and I am running into stumbling blocks. I am interested in getting coverage info for genes and I downloaded the Refseq genes list and am filtering and sorting according to specifications. I am using the SortByRef.pl script provided in the website. However, I am getting an error saying: Can not open temporary file 1045: Too many open files at SortByRef.pl line 68, <$INPUT> line 1021.


I also checked the soft/hard limits for filesystem and it shows that my current system's limit is the following:

ulimit -u :386246
ulimit -S -u: 386246


I am stuck at this point and don't know how to proceed.

Any help is much appreciated.

Thanks

Hello,

i have spend many hours trying to run GATK depthofcoverage but it never works. last try: -T DepthOfCoverage -R /home/remi/Analyse/CNV/ERR125905/bam/Chloroplastgenomebarley.fa -I /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam -L /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bed -o /home/remi/Analyse/CNV/FishingCNV_1.5.3/out/ERfiltre.bam.coverage --minMappingQuality 15 --minBaseQuality 10 --omitDepthOutputAtEachBase --logging_level INFO --summaryCoverageThreshold 5 --summaryCoverageThreshold 7 --summaryCoverageThreshold 10 --summaryCoverageThreshold 15 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 50

My BAM header seem to be malformed.

ERROR MESSAGE: SAM/BAM file /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups

here is the 1rst line of the header:

@SQ SN:Chloroplastgenomebarley LN:136462 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB??

I Have search in the forum and doc about it. I have try to reorder my header with picard:

@HD VN:1.4 SO:unsorted @SQ SN:Chloroplastgenomebarley LN:136462 UR:file:/home/remi/Analyse/REFGEN/Chloroplastgenomebarley.fa M5:7a7b36ef01cc1a2af1c8451ca3800f93 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB?? but no more change.

Someone can help me please ?

Regards, Remi



Hi there,

this is my interval_list

chr1 762095 762275 LINC00115|NR_024321 chr1 762280 762414 LINC00115|NR_024321 chr1 762420 762565 LINC00115|NR_024321 chr1 777259 777349 LOC643837 chr1 777391 777481 LOC643837 chr1 777482 777642 LOC643837 chr1 783061 783151 LOC643837 chr1 792270 792446 LOC643837 chr1 861266 861496 NM_152486|SAMD11 chr1 865582 865787 NM_152486|SAMD11 chr1 866331 866507 NM_152486|SAMD11

and this is the output from the sample_interval_summary

chr1:762095-762275 ... chr1:762280-762414 ... chr1:762420-762565 ... chr1:777259-777349 ... chr1:783061-783151 ... chr1:792270-792446 ... chr1:861266-861496 ... chr1:865582-865787 ... chr1:866331-866507 ...

why am I missing two exons?

this is my cmd:

java -Xmx32g -jar /local/apps/gatk/2.5-2-gf57256b/GenomeAnalysisTK.jar -I sample.bam -R .../genome.fa -T DepthOfCoverage -o jtn -geneList hg19.tsv -L exons.list --omitDepthOutputAtEachBase --includeDeletions --interval_merging OVERLAPPING_ONLY -l INFO

/M

I am running GATK DepthOfCoverage one a bunch of samples (sequenced using Illumina MiSeq). For one of the samples, I am getting the following error:

ERROR MESSAGE: SAM/BAM file SAMFileReader{<file.bam>} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 61; please see the GATK --help documentation for options related to this error

I found a suggestion that I should be using fixMisencodedQuals flag in this case. However, if I do that, "the engine will simply subtract 31 from every quality score as it is read in". That will fix this error, but then all my quality scores will be incorrect.

Furthermore, the BAM file I am using was last recalibrated with GATK. Why is GATK producing files with invalid quality scores?

What should I do?

Any help would be much appreciated.

Hello, I´m trying to use deptofcoverage to check the coverage of my reads. I already have the indexed bam files (created with sam to bam) and also reordered (reorder SAM/BAM) but they are still not recognized by depthofcoverage and I got this error message:

"Sequences are not currently available for the specified build"

I used "human (homo sapiens) hg19 full" for mapping but I can´t select it, it only allows b37 version.

Any suggestions?

Thank you very much in advance

Gema

Dear team,

I would like to include a RefSeq ROD file in order to get the coverage per gene using the GATK coverage tool (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html). However, it is not really clear to me how I can easily generate such a file, since I can not find the right documentation. All the links that should point to this information seem to be (incorrectly?) redirected to the main GATK homepage). For example this http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_refseq_RefSeqCodec.html has a link that points to http://www.broadinstitute.org/gsa/wiki/index.php/RefSeq which is redirected to http://www.broadinstitute.org/gatk/. Can someone point me to the correct docs? Other resources I found also point to the same wiki, which I can't find at the moment..

Kind regards, JJ

Just in the process of updating our pipeline from v2.3-4-gb8f1308 Lite to v2.4-7-g5e89f01 Academic and have run into a small issue. The command line:

-T UnifiedGenotyper -glm SNP -R /lustre/scratch111/resources/ref/Homo_sapiens/1000Genomes_hs37d5/hs37d5.fa -I /lustre/scratch111/projects/helic/vcf-newpipe/lists/chr1-pooled.list --alleles /lustre/scratch111/projects/helic/vcf-newpipe/pooled/1/1:1-1000000.snps.vcf.gz -L 1:1-1000000 -U LENIENT_VCF_PROCESSING -baq CALCULATE_AS_NECESSARY -gt_mode GENOTYPE_GIVEN_ALLELES -out_mode EMIT_ALL_SITES --standard_min_confidence_threshold_for_calling 4.0 --standard_min_confidence_threshold_for_emitting 4.0 -l INFO -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A DepthOfCoverage -o /lustre/scratch111/projects/helic/vcf-newpipe/pooled/1/1:1-1000000.asnps.vcf.gz.part.vcf.gz


This worked in 2.3.4. But now gives:

Invalid command line: Argument annotation has a bad value: Class DepthOfCoverage is not found; please check that you have specified the class name correctly


I've looked at the release notes but it's not giving me a clue as to what has changed. Has the DepthOfCoverage annotation now been dropped? I've checked and I can reproduce this on the latest nightly (nightly-2013-03-11-g184e5ac)

Now that DepthOfCoverage is being retired in GATK 2.4, I decided to investigate DiagnoseTargets. I have some questions.

1) Are there plans to support output formats other than VCF? What was great about DOC is I could easily send the output to anyone and it could be easily read. With DT, that requires additional processing.

2) DOC provided summary for intervals as well as samples. DT only does intervals. Is there a way to get per-sample info?

3) DOC output full intervals. DT only outputs the start positions. To get the end positions, an additional step is required. Can that be adjusted?

4) DOC provided coverage info for all intervals. DT only shows covered intervals, so if an interval is not covered, it will not be listed in the output. Is there a way to output all intervals?

Sorry if I sound too critical. I am a big fan of DepthOfCoverage and am disappointed to see it go.

Thank you.

Hello,

For matched tumor and normal pairs, we easily get insertion and deletion counts from the output of Somatic Indel Detector in GATK. However, when we run multiple samples from the same patient, sometimes calls are made in one sample but not another, so we might not have the numbers for all samples for all indel events. We can get the deletion counts from Depth of Coverage in GATK, but retrieving insertions is trickier.

Does you have a suggestion for how to solve this problem in an automated (ie non-IGV fashion)?

Additionally, as DepthofCoverage is being retired, what do you recommend that we use for getting SNP and deletion counts?

Thank you

Hi,

I am learning to use the DepthofCoverage function to obtain the gene coverage information for a collection of bacterial contigs that were mapped with metagenomic reads. The original post introducing this function is here: http://gatkforums.broadinstitute.org/discussion/40/depthofcoverage-v3-0-how-much-data-do-i-have#latest

In the post, you mentioned the gene list, as follow:

-geneList /path/to/gene/list.txt

The provided gene list must be of the following format:

585     NM_001005484    chr1    +       58953   59871   58953   59871   1       58953,  59871,  0       OR4F5   cmpl    cmpl    0,
587     NM_001005224    chr1    +       357521  358460  357521  358460  1       357521, 358460, 0       OR4F3   cmpl    cmpl    0,
`

I have three inquiries:

1. Can you please provide headers to the values in each column?
2. I am working with bacterial genomic contigs, can you please specify what basic information is needed for a gene list (e.g., name of contig, name of gene, location of gene in the contig, from... to ..., etc.)?

Thanks so much!

Leo

I am getting the following error when running DepthOfCoverage:

I have already reheadered my bam file to fix a contig mismatch error, and the fasta dict file was generated automatically by gatk. Moreover, about 160 lines of output are generated, but I do not see any irregularities at the position where the code crashed. Please let me know what I can try. Thank you.

when " RMDTrackBuilder - Loading Tribble index from disk for file refGene.sorted.txt " error occurs:

##### ERROR MESSAGE: Badly formed genome loc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary

is it means that my refGene.sorted.txt or b37.fasta file is badly formed? what should I do ?

I was wondering if there is an option to remove duplicate reads when the coverage is determined using DepthOfCoverage from a .BAM file. Or is there an alternate way to remove the duplicate reads.

Thanks Amin