Tribble
From GSA
Contents |
Overview
The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library.
Architecture Overview
Tribble provides a lightweight interface and API for querying features and creating indexes from feature files, while allowing iteration over know feature files that we're unable to create indexes for. The main entry point for external users is the BasicFeatureReader class. It takes in a codec, an index file, and a file containing the features to be processed. With an instance of a BasicFeatureReader, you can query for features that span a specific location, or get an iterator over all the records in the file.
Developer Overview
For developers there are two classes that are important:
- Feature. This is the genomicly oriented feature that represents the underlying data in the input file. For instance in the VCF format, this is the variant call including quality information, the reference base, and the alternate base. The required information to implement a feature is the chromosome name, the start position (one based), and the stop position. The start and stop position represent a closed, one-based interval. I.e. the first base in chromosome one would be chr1:1-1.
- FeatureCodec. This class takes in a line of text (from an input source, whether it's a file, compressed file, or a http link), and produces the above feature.
To implement your new format into Tribble, you need to implement the two above classes (in an appropriately named subfolder in the Tribble check-out). The Feature object should know nothing about the file representation; it should represent the data as an in-memory object. The interface for a feature looks like:
public interface Feature {
/**
* Return the features reference sequence name, e.g chromosome or contig
*/
public String getChr();
/**
* Return the start position in 1-based coordinates (first base is 1)
*/
public int getStart();
/**
* Return the end position following 1-based fully closed conventions. The length of a feature is
* end - start + 1;
*/
public int getEnd();
}
And the interface for FeatureCodec:
/**
* the base interface for classes that read in features.
* @param <T> The feature type this codec reads
*/
public interface FeatureCodec<T extends Feature> {
/**
* Decode a line to obtain just its FeatureLoc for indexing -- contig, start, and stop.
*
* @param line the input line to decode
* @return Return the FeatureLoc encoded by the line, or null if the line does not represent a feature (e.g. is
* a comment)
*/
public Feature decodeLoc(String line);
/**
* Decode a line as a Feature.
*
* @param line the input line to decode
* @return Return the Feature encoded by the line, or null if the line does not represent a feature (e.g. is
* a comment)
*/
public T decode(String line);
/**
* This function returns the object the codec generates. This is allowed to be Feature in the case where
* conditionally different types are generated. Be as specific as you can though.
*
* This function is used by reflections based tools, so we can know the underlying type
*
* @return the feature type this codec generates.
*/
public Class<T> getFeatureType();
/** Read and return the header, or null if there is no header.
*
* @return header object
*/
public Object readHeader(LineReader reader);
}
Supported Formats
The following formats are supported in Tribble:
- VCF Format
- DbSNP Format
- BED Format
- GATK Interval Format
