Tagged with #downsampling
1 documentation article | 0 announcements | 13 forum discussions



Created 2012-08-11 05:16:06 | Updated 2015-12-19 10:53:18 | Tags: developer downsampling

Comments (1)

Downsampling is a process by which read depth is reduced, either at a particular position or within a region.

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Note that there is also a proportional "downsample to fraction" mechanism that is mostly intended for testing the effect of different overall coverage means on analysis results.

See below for details of how this is implemented and controlled in GATK.


1. Downsampling to a target coverage

The principle of this downsampling type is to downsample reads to a given capping threshold coverage. Its purpose is to get rid of excessive coverage, because above a certain depth, having additional data is not informative and imposes unreasonable computational costs. The downsampling process takes two different forms depending on the type of analysis it is used with. For locus-based traversals (LocusWalkers like UnifiedGenotyper and ActiveRegionWalkers like HaplotypeCaller), downsample_to_coverage controls the maximum depth of coverage at each locus. For read-based traversals (ReadWalkers like BaseRecalibrator), it controls the maximum number of reads sharing the same alignment start position. For ReadWalkers you will typically need to use much lower dcov values than you would with LocusWalkers to see an effect. Note that this downsampling option does not produce an unbiased random sampling from all available reads at each locus: instead, the primary goal of the to-coverage downsampler is to maintain an even representation of reads from all alignment start positions when removing excess coverage. For a truly unbiased random sampling of reads, use -dfrac instead. Also note that the coverage target is an approximate goal that is not guaranteed to be met exactly: the downsampling algorithm will under some circumstances retain slightly more or less coverage than requested.

Defaults

The GATK's default downsampler (invoked by -dcov) exhibits the following properties:

  • The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.
  • The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.
  • The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.

Customizing

From the command line:

  • To disable the downsampler, specify -dt NONE.
  • To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

  • Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:

For each sample:

  • Select reads with the next alignment start.
  • While the number of existing reads + the number of incoming reads is greater than the target sample size:

Now walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.

  • If we have n slots available where n is >= 1, randomly select n of the incoming reads and add them to the pileup.
  • Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

2. Downsampling to a fraction of the coverage

Reads will be downsampled so the specified fraction remains; e.g. if you specify -dfrac 0.25, three-quarters of the reads will be removed, and the remaining one quarter will be used in the analysis. This method of downsampling is truly unbiased and random. It is typically used to simulate the effect of generating different amounts of sequence data for a given sample. For example, you can use this in a pilot experiment to evaluate how much target coverage you need to aim for in order to obtain enough coverage in all loci of interest.

No articles to display.


Created 2016-04-22 14:11:11 | Updated 2016-04-22 14:17:51 | Tags: haplotypecaller downsampling memory

Comments (1)

Hi there,

I am working on mitochondrial genomes (250 samples) with a coverage of ~7000x. The HC (v 3.5) is running perfectly for the firsts 248 samples but reaching the last ones, the following Java error appears :

ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.

All the other samples work perfectly with ~64G of mem and v_mem so i tried 128G but it is still not sufficient for my last samples to be processed. As you probably understand, i can't extend the memory indefinitely and I would want to know if it exists a way to force the down-sampling or at least to greatly ameliorate the memory used by the HaplotypeCaller.

Thanks for your time,

Regards,

Alex H


Created 2016-02-29 22:25:17 | Updated | Tags: downsampling mutect2

Comments (4)

According to MuTect2s docs, the default is to downsample by sample to 1,000x. I would like to lower this substantially to something like 3,000 or lower. When I use the -dcov argument in my command line I get:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Invalid command line: Cannot use -dcov or --downsample_to_coverage for ActiveRegionWalkers, use another downsampling argument
ERROR ------------------------------------------------------------------------------------------

Is there another way to adjust the downsampling in MuTect2?


Created 2015-11-11 15:45:08 | Updated | Tags: haplotypecaller downsampling ad dp read-counts

Comments (1)

What I have learned so far from other discussions about HaplotypeCaller:

  • read counts for positions with very high coverage are downsampled
  • this does not affect variant calling
  • this does affect DP and AD fields in the output (g)vcf file
  • reads are selected randomly
  • don't use -nct parameter with HC
  • downsampling is hard-coded and can't be influenced by parameters

Nonetheless two problems remain: The HC doc says "This tool applies the following downsampling settings by default. To coverage: 500" Why is it possible to observe much higher coverage (DP, AD) values in the output vcf file?

I observe SNPs where the recalibrated bam file in IGV has a depth of 1385 for the reference and 1233 for alternate allele but 839 (reference) and 246 (alt) in the HaplotypeCaller vcf file. Maybe this happens by chance, as reads for downsampling are chosen at random or it is related to this bug [gatkforums.broadinstitute.org/discussion/5882/uncorrect-strand-bias-due-to-downsampling-haplotypecaller](http://gatkforums.broadinstitute.org/discussion/5882/uncorrect-strand-bias-due-to-downsampling-haplotypecaller

Both observations lead to the conclusion that DP and AD values from HC output are of little use for samples with high (where does high start? 500?) coverage.


Created 2015-10-12 08:22:52 | Updated | Tags: downsampling randomly

Comments (3)

Hello, As I saw from this link :http://gatkforums.broadinstitute.org/discussion/1237/how-unified-genotyper-works-retired downsampling selects some subset reads randomly in Unified Genotype, meanwhile random seed values are affected by several things like multithreading and intervals run in the thread. I want to ask if there is a "tool" in GATK could select subset reads not randomly, I mean with some intentions.


Created 2015-10-09 09:31:41 | Updated | Tags: downsampling genotypegvcfs

Comments (2)

The documentation for GenotypeGVCFs states that it is employing downsampling:

Downsampling settings This tool applies the following downsampling settings by default.

Mode: BY_SAMPLE To coverage: 1,000

I also see this in my output :"INFO 14:51:47,705 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000"

Specifically what is it downsampling, the genotype likelihoods per locus or active region? Is this behaviour changeable via -dcov, or would changing it be inadvisable?


Created 2015-07-23 11:26:42 | Updated | Tags: fisherstrand haplotypecaller downsampling strand-bias

Comments (15)

Hi GATK team, Again thanks a lot for the wonderful tools you're offering to the community.

I have recently switched from UnifiedGenotyper to Haplotype Caller (1 sample at a time, DNASeq). I was planning to use the same hard filtering procedure that I was using previously, including the filter of the variants with FS > 60. However I am facing an issue probably due to the downsampling done by HC.

I should have 5000 reads, but DP is around 500/600 which I understood is due to downsampling (even with -dt NONE). I did understand that it does not impact in the calling itself. However it is annoying me for 2 reasons 1) Calculating frequency of the variant using the AD field is not correct (not based on all reads) 2) I get variants with FS >60 whereas when you look at the entire set of reads, there is absolutely no strand bias.

Example with this variant chr17 41245466 rs1799949 G A 7441.77 STRAND_BIAS; AC=1;AF=0.500;AN=2;BaseQRankSum=7.576;DB;DP=1042;FS=63.090;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.666;QD=7.14;ReadPosRankSum=-11.896;SOR=5.810 GT:AD:GQ:PL:SB 0/1:575,258:99:7470,0,21182:424,151,254,4

When I observe all reads I have the following counts, well shared on the + and - strands Allele G : 1389 (874+, 515-) Allele A : 1445 (886+, 559-)

Could you please tell me how to avoid such an issue ? (By the way, this variant is a true one and should not be filtered out).

Thanks a lot.


Created 2014-01-27 10:00:01 | Updated | Tags: printreads downsampling

Comments (5)

Hello,

I ran PrintReads with option -ds and got the error message showed in the title. Looking into docs, I see that there are other options for downsampling under CommandLineGATK like downsample to coverage, downsample to fraction, downsampling type. Am I right to think that the previous option -ds is equivalent to -dfrac combined with -dt ALL_READS?

Best regards, Jacek


Created 2013-08-29 01:11:12 | Updated | Tags: depthofcoverage haplotypecaller downsampling

Comments (6)

Hello!

Trying to downsample in an orderly fashion in the name of experimentation, and in doing so would like to specify just one chromosome for the experiment - so I picked chromosome 17 with -L and a coverage of 30x with -dcov 30. This came up:

ERROR MESSAGE: Locus-based traversals (ie., Locus and ActiveRegion walkers) require a minimum -dcov value of 200 when downsampling to coverage. Values less than this can produce problematic downsampling artifacts while providing only insignificant improvements in memory usage in most cases.

I was hoping to poke through results from using the HaplotypeCaller with many different simulated depths of coverage for several samples. I read that one can use -dfrac instead, and that it might even be more appropriate, though I was hoping to find out what level of coverage led to what level of results and using -dfrac feels much less specific as it appears to toss a fraction of however many reads where at a given position, rather then tossing reads over a certain coverage. Thus with -dfrac, I could say that my sample had an average of 30x for this chromosome and I tossed half so theoretically I've simulated 15x depth of coverage...

Which approach would be more representative of reality? Using -dfrac to simulate a certain depth of coverage, or -dcov assuming I didn't have the 200 restriction?

Thanks for any help/discussion! -Tristan


Created 2013-08-13 18:41:40 | Updated | Tags: haplotypecaller downsample downsampling

Comments (28)

Hello,

I'm running the Haplotype Caller on 80 bams, multithreaded (-nt 14), with either vcfs or bed files as the intervals file. I am trying to downsample to approximately 200 reads (-dcov 200). However, my output vcfs clearly have much higher depth (over 300-500 reads) in some areas. Also, HC drastically slows down whenever it hits a region of very deep pile-ups (over 1000 reads).

Is there any way to speed up HC on regions of very deep pileups? Is there a way to make downsampling stricter?

Thank you for your help. Elise


Created 2013-05-14 04:45:55 | Updated | Tags: downsampling

Comments (2)

I have noticed some strange structure in the samples produced by the LevelingDownSampler class and was hoping someone might be nice enough to explain what is going on and how to avoid it.

I was attempting to use the unified genotyper to infer frequencies of heteroplasmic mutations in mitochondrial DNA by setting a high ploidy level to account for the multiplicity of mtDNA in a sample. However, when test datasets with different mitochondrial sequences were combined at known frequencies, the frequency inferred by the GATK (alleles at different levels of ploidy) was consistently different from the known mixing frequency. In other words, the recovered frequencies were biased.

As the bias was reduced as the dcov parameter was increased, I looked at the downsampler class to check for issues. It appeared that the levelGroups function was not randomly sampling reads, as there was a lot of structure in the groups it was leveling as well as the way it chose to level them.

I created a toy dataset by placing a SNP in a 300 bp window with high coverage and then merging two BAM files with samtools. I then had the class output at the end of the GATK analysis the number of reads that came from each original alignment position and from each file (or those reads returned by the consumeFinalizedItems method). The two issues were:

First, the downsampler appeared to sample too many reads at the start of the genome segment before leveling off to correct level, before dropping again at the end of the segment being sequenced (though plenty of reads were available). Is it possible to avoid this? Second, although in the middle of the genomic segment the read counts were uniformly distributed, their source in the BAM file was not. At low downsampling settings (<75) there would be bursts where one section of reads came entirely from one of the source bam files, followed by another burst where all the reads came from a different bam file (though these files were combined in the merge). Does anyone know if it is the case that spatial structure in the BAM file will appear as spatial structure in the downsampled reads? Is there any way to avoid this behavior?


Created 2012-11-13 16:53:49 | Updated 2012-11-19 20:07:33 | Tags: unifiedgenotyper downsampling

Comments (12)

I ran the same sample through a pipeline using GATK twice and received different variants. I am trying to understand the reason behind this. My samples are from a MiSeq/capture kit run and downsampling could be one reason (given in one scenario that variant is called and in other it isn't) the variant is called at 32% when looked into the .bam files.

As I understand the UnifiedGenotyper downsamples my dataset randomly to 250, so I played around with -dcov parameter

  • same sample run twice, 1st run reports a variant; 2nd run doesn't.
  • up -dcov to 1000 neither run reports the variant.
  • up -dcov to 10,000 1st run again reports a variant; 2nd run doesn't.
  • set -dt NONE both runs call that variant

But setting -dt to NONE could be computationally exhaustive for a big sample set. Is there an identifiable reason to why this is happening..?

Curious..!


Created 2012-10-17 22:19:16 | Updated 2012-10-17 22:23:41 | Tags: unifiedgenotyper downsample downsampling

Comments (3)

I haven't been using GATK for long, but I assumed that downsample_to_coverage feature wouldn't ever be a cause for concern. I just tried running UnifiedGenotyper with -dcov set at 500, 5,000, and 50,000 on the same 1-sample BAM file. One would expect the results to be similar. However, 500 yielded 26 variants, 5,000 yielded 13, and 50,000 yielded just 1. Depth of that one variant was about 1,300 in the 50,000 cutoff. Why are the results so different?

Most of the other variants are in the biggest set were cut off at 500, so some reads were filtered. A few of them are at relatively low frequency, but most are at 25% or higher. If they are appearing by chance, they should not be at such high frequencies.

In addition, there are some variants that are below 500, so they should not be affected by the cutoff. Why are those showing up with the low cutoff and not the higher cutoff?

I am using GATK 2.1-8. I am looking at a single gene only, so that is why there are so few variants and such high coverage.


Created 2012-09-19 15:45:44 | Updated 2013-01-07 20:39:33 | Tags: unifiedgenotyper variantannotator downsampling

Comments (4)

I am trying to understand how Variant Annotator functions. I took the vcf file from the output of UnifiedGenotyper and ran Variant Annotator with the same .bam and .bed files I used for running UnifiedGenotyper. I expected that all the calculations in the INFO field will remain the same, since I am using the same input files. However, I find that some fields have different values. Here is an example: UnifiedGenotyper output:

chr22   30094366        .       G       A       172.01  .       AC=1;AF=0.500;AN=2;BaseQRankSum=3.182;DP=244;DS;Dels=0.00;FS=0.000;HaplotypeScore=118.5897;MLEAC=1;MLEAF=0.500;MQ=43.29;MQ0=0;MQRankSum=-0.049;QD=0.70;ReadPosRankSum=1.428;SB=-6.201e+01        GT:AD:DP:GQ:PL  0/1:220,19:244:99:202,0,2693 

VariantAnnotator output:

chr22   30094366        .       G       A       172.01  .       ABHet=0.923;AC=1;AF=0.500;AN=2;BaseQRankSum=3.182;DP=993;DS;Dels=0.00;FS=0.000;HaplotypeScore=454.8386;MLEAC=1;MLEAF=0.500;MQ=43.29;MQ0=0;MQ0Fraction=0.0000;MQRankSum=-0.378;OND=0.034;QD=0.17;ReadPosRankSum=-4.859;SB=-6.201e+01      GT:AD:DP:GQ:PL  0/1:220,19:244:99:202,0,2693

I am running GATKLite 2.1. Notice the DP in the info field has a different value. HaplotypeScore, QD, MQRankSum, etc have different values as well. Am I doing anything wrong? Shouldn't these values be the same when I recalculate these fields using VariantAnnotator?