Managing and filtering reads

From GSA

Jump to: navigation, search

Contents

Accessing reads: AlignmentContext and ReadBackedPileup

The AlignmentContext and ReadBackedPileup work together to provide the read data associated with a given locus. This section details the tools the GATK provides for working with collections of aligned reads.

What are read backed pileups?

Read backed pileups are objects that contain all of the reads and their offsets that "pile up" at a locus on the genome (See figure). They are the basic input data for the GATK LocusWalkers, and underlie most of the locus-based analysis tools like the recalibrator and SNP caller. Unfortunately, there are many ways to view this data, and version one grew unwieldy trying to support all of these approaches. Version two of the ReadBackedPileup presents a consistent and clean interface for working pileup data, as well as supporting the iterable() interface to enable the convenient for ( PileupElement p : pileup ) for-each loop support.

How do I get a ReadBackedPileup and how do I create them?

  • The best way is to simply grab the pileup (the underlying representation of the locus data) from your AlignmentContext object in map:
    public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context)
        ReadBackedPileup pileup = context.getPileup();

This aligns your calculations with the GATK core infrastructure, and avoids any unnecessary data copying from the engine to your walker.

  • If you are trying to create your own, the best constructor is:
    public ReadBackedPileup(GenomeLoc loc, ArrayList<PileupElement> pileup )

requiring only a list, in order of read / offset in the pileup, of PileupElements.

  • From List<SAMRecord> and List<Offset>

If you happen to have lists of SAMRecords and integer offsets into them you can construct a ReadBackedPileup:

    public ReadBackedPileup(GenomeLoc loc, List<SAMRecord> reads, List<Integer> offsets )
 

What's the best way to use them?

Best way if you just need reads, bases and quals

for ( PileupElement p : pileup ) {
  System.out.printf("%c %c %d%n", p.getBase(), p.getSecondBase(), p.getQual());
  // you can get the read itself too using p.getRead()
}

Is the most efficient way to get data, and should be used whenever possible.

I just want a vector of bases and quals

You can use:

    public byte[] getBases()
    public byte[] getSecondaryBases()
    public byte[] getQuals()

To get the bases and quals as a byte[] array, which is the underlying base representation in the SAM-JDK.

All I care about are counts of bases

Use the follow function to get counts of A, C, G, T in order

    public int[] getBaseCounts()

Which returns a int[4] vector with counts according to BaseUtils.simpleBaseToBaseIndex for each base.

Can I view just the reads for a given sample, read group, or any other arbitrary filter?

The GATK can very efficiently stratify pileups by sample, and less efficiently stratify by read group, strand, mapping quality, base quality, or any arbitrary filter function. The sample-specific functions can be called as follows:

pileup.getSamples();
pileup.getPileupForSample(String sampleName);

In addition to the rich set of filtering primitives built into the ReadBackedPileup, you can supply your own primitives by implmenting a PileupElementFilter:

public interface PileupElementFilter {
    public boolean allow(final PileupElement pileupElement);
}

and passing it to ReadBackedPileup's generic filter function:

public ReadBackedPileup getFilteredPileup(PileupElementFilter filter);

See the ReadBackedPileup's javadoc for a complete list of built-in filtering primitives.

Historical: StratifiedAlignmentContext

While ReadBackedPileup is the preferred mechanism for aligned reads, some walkers still use the StratifiedAlignmentContext to carve up selections of reads. If you find functions that you require in StratifiedAlignmentContext that seem to have no analog in ReadBackedPileup, please let us know and we'll port the required functions for you.

Sampling and filtering reads

Reads can be filtered out of traversals by either pileup size through one of our downsampling methods or by read property through our read filtering mechanism. Both techniques and described below.

Downsampling

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Defaults

The GATK's default downsampler exhibits the following properties:

  • The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.
  • The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.
  • The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.

Customizing

From the command line:

  • To disable the downsampler, specify -dt NONE.
  • To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

  • Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:

For each sample:

  • Select <sample size> reads with the next alignment start.
  • While the number of existing reads + the number of incoming reads is greater than the target sample size:
    • Walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.
  • If we have n slots avaiable where n is >= 1, randomly select n of the incoming reads and add them to the pileup.
  • Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

Read filtering

To selectively filter out reads before they reach your walker, implement one or multiple net.sf.picard.filter.SamRecordFilter, and attach it to your walker as follows:

@ReadFilters({Platform454Filter.class, ZeroMappingQualityReadFilter.class})

Command-line arguments for read filters

You can add command-line arguments for filters with the @Argument tag, just as with walkers. Here's an example of our new max read length filter:

public class MaxReadLengthFilter implements SamRecordFilter {
    @Argument(fullName = "maxReadLength", shortName = "maxRead", doc="Discard reads with length greater than the specified value", required=false)
    private int maxReadLength;
    
    public boolean filterOut(SAMRecord read) { return read.getReadLength() > maxReadLength; }
}

Adding this filter to the top of your walker using the @ReadFilters attribute will add a new required command-line argument, maxReadLength, which will filter reads > maxReadLength before your walker is called.

Adding filters dynamically using command-line arguments

The --read-filter argument will allow you to apply whatever read filters you'd like to your dataset, before the reads reach your walker. To add the MaxReadLength filter above to PrintReads, you'd add the command line parameters:

--read_filter MaxReadLength --maxReadLength 76

You can add as many filters as you like by using multiple copies of the --read_filter parameter:

--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead
Personal tools