Tagged with #rod
1 documentation article | 0 announcements | 0 forum discussions


Comments (2)

Brief introduction to reference metadata (RMDs)

Note that the -B flag referred to below is deprecated; these docs need to be updated

The GATK allows you to process arbitrary numbers of reference metadata (RMD) files inside of walkers (previously we called this reference ordered data, or ROD). Common RMDs are things like dbSNP, VCF call files, and refseq annotations. The only real constraints on RMD files is that:

  • They must contain information necessary to provide contig and position data for each element to the GATK engine so it knows with what loci to associate the RMD element.

  • The file must be sorted with regard to the reference fasta file so that data can be accessed sequentially by the engine.

  • The file must have a Tribble RMD parsing class associated with the file type so that elements in the RMD file can be parsed by the engine.

Inside of the GATK the RMD system has the concept of RMD tracks, which associate an arbitrary string name with the data in the associated RMD file. For example, the VariantEval module uses the named track eval to get calls for evaluation, and dbsnp as the track containing the database of known variants.

How do I get reference metadata files into my walker?

RMD files are extremely easy to get into the GATK using the -B syntax:

java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:variant,VCF calls.vcf

In this example, the GATK will attempt to parse the file calls.vcf using the VCF parser and bind the VCF data to the RMD track named variant.

In general, you can provide as many RMD bindings to the GATK as you like:

java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:calls1,VCF calls1.vcf -B:calls2,VCF calls2.vcf

Works just as well. Some modules may require specifically named RMD tracks -- like eval above -- and some are happy to just assess all RMD tracks of a certain class and work with those -- like VariantsToVCF.

1. Directly getting access to a single named track

In this snippet from SNPDensityWalker, we grab the eval track as a VariantContext object, only for the variants that are of type SNP:

public Pair<VariantContext, GenomeLoc> map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
    VariantContext vc = tracker.getVariantContext(ref, "eval", EnumSet.of(VariantContext.Type.SNP), context.getLocation(), false);
}

2. Grabbing anything that's convertable to a VariantContext

From VariantsToVCF we call the helper function tracker.getVariantContexts to look at all of the RMDs and convert what it can to VariantContext objects.

Allele refAllele = new Allele(Character.toString(ref.getBase()), true);
Collection<VariantContext> contexts = tracker.getVariantContexts(INPUT_RMD_NAME, ALLOWED_VARIANT_CONTEXT_TYPES, context.getLocation(), refAllele, true, false);

3. Looking at all of the RMDs

Here's a totally general code snippet from PileupWalker.java. This code, as you can see, iterates over all of the GATKFeature objects in the reference ordered data, converting each RMD to a string and capturing these strings in a list. It finally grabs the dbSNP binding specifically for a more detailed string conversion, and then binds them all up in a single string for display along with the read pileup.

private String getReferenceOrderedData( RefMetaDataTracker tracker ) { ArrayList rodStrings = new ArrayList(); for ( GATKFeature datum : tracker.getAllRods() ) { if ( datum != null && ! (datum.getUnderlyingObject() instanceof DbSNPFeature)) { rodStrings.add(((ReferenceOrderedDatum)datum.getUnderlyingObject()).toSimpleString()); // TODO: Aaron: this line still survives, try to remove it } } String rodString = Utils.join(", ", rodStrings);

        DbSNPFeature dbsnp = tracker.lookup(DbSNPHelper.STANDARD_DBSNP_TRACK_NAME, DbSNPFeature.class);

        if ( dbsnp != null)
            rodString += DbSNPHelper.toMediumString(dbsnp);

        if ( !rodString.equals("") )
            rodString = "[ROD: " + rodString + "]";

        return rodString;
}

How do I write my own RMD types?

Tracks of reference metadata are loaded using the Tribble infrastructure. Tracks are loaded using the feature codec and underlying type information. See the Tribble documentation for more information.

Tribble codecs that are in the classpath are automatically found; the GATK discovers all classes that implement the FeatureCodec class. Name resolution occurs using the -B type parameter, i.e. if the user specified:

-B:calls1,VCF calls1.vcf

The GATK looks for a FeatureCodec called VCFCodec.java to decode the record type. Alternately, if the user specified:

-B:calls1,MYAwesomeFormat calls1.maft

THe GATK would look for a codec called MYAwesomeFormatCodec.java. This look-up is not case sensitive, i.e. it will resolve MyAwEsOmEfOrMaT as well, though why you would want to write something so painfully ugly to read is beyond us.

No posts found with the requested search criteria.
No posts found with the requested search criteria.