## DownsamplingPosted in Dictionary on 2012-08-11 05:16:06 | Last updated on 2015-12-19 10:53:18

#### Downsampling is a process by which read depth is reduced, either at a particular position or within a region.

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Note that there is also a proportional "downsample to fraction" mechanism that is mostly intended for testing the effect of different overall coverage means on analysis results.

See below for details of how this is implemented and controlled in GATK.

## 1. Downsampling to a target coverage

### Defaults

The GATK's default downsampler (invoked by -dcov) exhibits the following properties:

• The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.
• The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.
• The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.

### Customizing

From the command line:

• To disable the downsampler, specify -dt NONE.
• To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

• Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

### Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:

For each sample:

• Select reads with the next alignment start.
• While the number of existing reads + the number of incoming reads is greater than the target sample size:

Now walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.

• If we have n slots available where n is >= 1, randomly select n of the incoming reads and add them to the pileup.
• Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

## 2. Downsampling to a fraction of the coverage

Reads will be downsampled so the specified fraction remains; e.g. if you specify -dfrac 0.25, three-quarters of the reads will be removed, and the remaining one quarter will be used in the analysis. This method of downsampling is truly unbiased and random. It is typically used to simulate the effect of generating different amounts of sequence data for a given sample. For example, you can use this in a pilot experiment to evaluate how much target coverage you need to aim for in order to obtain enough coverage in all loci of interest.