Managing walker data presentation and control flow
The primary goal of the GATK is to provide a suite of small data access patterns that can easily be parallelized and otherwise externally managed. As such, rather than asking walker authors how to iterate over a data stream, the GATK asks the user how data should be presented.
Contents |
Locus walkers
Walk over the data set one location (single-base locus) at a time, presenting all overlapping reads, reference bases, and reference-ordered data.
Switching between covered and uncovered loci
The @By attribute can be used to control whether locus walkers see all loci or just covered loci. To switch between viewing all loci and covered loci, apply one of the following attributes:
@By(DataSource.REFERENCE) @By(DataSource.READS)
Filtering defaults
By default, the following filters are automatically added to every locus walker.
- Reads with nonsensical alignments
- Unmapped reads
- Non-primary alignments.
- Duplicate reads.
- Reads failing vendor quality checks.
ROD walkers
Walk over the data set one location at a time, but only those locations covered by reference-ordered data. Really a special case of locus walkers. ROD walkers are read-free traversals that include operate over Reference Ordered Data and the reference genome at sites where there is ROD information. They are geared for high-performance traversal of many RODs and the reference such as VariantEval and CallSetConcordance. Programmatically they are nearly identical to RefWalkers<M,T> traversals with the following few quirks.
Differences from a RefWalker
- RODWalkers are only called at sites where there is at least one non-interval ROD bound. For example, if you are exploring dbSNP and some GELI call set, the map function of a RODWalker will be invoked at all sites where there is a dbSNP record or a GELI record.
- Because of this skipping RODWalkers receive a context object where the number of reference skipped bases between map calls is provided:
nSites += context.getSkippedBases() + 1; // the skipped bases plus the current location
In order to get the final count of skipped bases at the end of an interval (or chromosome) the map function is called one last time with null ReferenceContext and RefMetaDataTracker objects. The alignment context can be accessed to get the bases skipped between the last (and final) ROD and the end of the current interval.
Filtering defaults
ROD walkers inherit the same filters as locus walkers:
- Reads with nonsensical alignments
- Unmapped reads
- Non-primary alignments.
- Duplicate reads.
- Reads failing vendor quality checks.
Example change over of VariantEval
Changing to a RODWalker is very easy -- here's the new top of VariantEval, changing the system to a RodWalker from it's old RefWalker state:
//public class VariantEvalWalker extends RefWalker<Integer, Integer> {
public class VariantEvalWalker extends RodWalker<Integer, Integer> {
The map function must now capture the number of skipped bases and protect itself from the final interval map calls:
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
nMappedSites += context.getSkippedBases();
if ( ref == null ) { // we are seeing the last site
return 0;
}
nMappedSites++;
That's it.
Performance improvements
A ROD walker can be very efficient compared to a RefWalker in the situation where you have sparse RODs:
| RODWalker | RefWalker | |
|---|---|---|
| dbSNP and 1KG Pilot 2 SNP calls on chr1 | 164u (s) | 768u (s) |
| Just 1KG Pilot 2 SNP calls on chr1 | 54u (s) | 666u (s) |
Read walkers
Walk over the data set one read at a time, presenting all overlapping reference bases and reference-ordered data.
Filtering defaults
By default, the following filters are automatically added to every read walker.
- Reads with nonsensical alignments
Read pair walkers
Walk over a queryname-sorted BAM, presenting each mate and its pair. No reference bases or reference-ordered data are presented.
Filtering defaults
By default, the following filters are automatically added to every read pair walker.
- Reads with nonsensical alignments
Duplicate walkers
Walk over a read and all its marked duplicates. No reference bases or reference-ordered data are presented.
Filtering defaults
By default, the following filters are automatically added to every duplicate walker.
- Reads with nonsensical alignments
- Unmapped reads
- Non-primary alignments.