GATKReport
The GATKReport output file is a single text file containing multiple tables, formatted to be both human-readable and computationally accessible in languages like R.
Contents |
GATK Report
A data structure that allows data to be collected over the course of a walker's computation, then have that data written to a PrintStream such that it's human-readable, AWK-able, and R-friendly (given that you load it using the GATKReport loader module).
The goal of this object is to use the same data structure for both accumulating data during a walker's computation and emitting that data to a file for easy analysis in R (or any other program/language that can take in a table of results). Thus, all of the infrastructure below is designed simply to make printing the following as easy as possible.
Below is an example table:
#:GATKReport.v1.0:2 #:GATKTable:true:2:9:%.18E:%.15f:; #:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads cycle errorrate.61PA8.7 qualavg.61PA8.7 0 7.451835696110506E-3 25.474613284804366 1 2.362777171937477E-3 29.844949954504095 2 9.087604507451836E-4 32.875909752547310 3 5.452562704471102E-4 34.498999090081895 4 9.087604507451836E-4 35.148316651501370 5 5.452562704471102E-4 36.072234352256190 6 5.452562704471102E-4 36.121724890829700 7 5.452562704471102E-4 36.191048034934500 8 5.452562704471102E-4 36.003457059679770 #:GATKTable:false:2:3:%s:%c:; #:GATKTable:TableName:Description key column 1:1000 T 1:1001 A 1:1002 C
Here, we have a GATKReport - a well-formatted, easy to read representation of some tabular data. It begins with the report header and version. This report contains two individual GATK report tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data. Any object can be accpeted as data, unless the column is typed. In that case, the data must match the column type.
Usages
Simple GATK Report
The simple GATK report is an easy way to collect data and output it to a file.
A simple GATK Report consists of the following:
- A single table
- No primary key ( it is hidden )
- Optional:
- Only untyped columns. As long as the data is an Object, it will be accepted.
- Default column values being empty strings.
- Only untyped columns. As long as the data is an Object, it will be accepted.
Limitations
- A simple GATK report cannot contain multiple tables.
- It cannot contain typed columns, which prevents arithmetic gathering (a feature not yet implemented).
Example
The following code creates a simple GATK Report, fills it with data, and prints the output to console. The GATK report can and should be printed to an OutputFileStream.
// Create a new simple GATK report named "TableName" with columns: Roger, is, and Awesome
GATKReport report = GATKReport.newSimpleReport("TableName", "Roger", "is", "Awesome");
// Add data to simple GATK report
report.addRow( 12, 23.45, true);
report.addRow("string", 'A', 24.5D);
report.addRow("NaN", "", 2342000L);
// Print the report to console
report.print(System.out);
Below is an example of a walker using the simple GATK Report.
@ActiveRegionExtension(extension=50)
public class CountReadsInActiveRegions extends ActiveRegionWalker<CountReadsInActiveRegions.Datum, GATKReport> {
@Output
@Gather(GATKReportGatherer.class)
PrintStream out;
public static class Datum {
private final GenomeLoc activeRegionLoc;
private final GenomeLoc extendedLoc;
public final boolean isActive;
public int nReads;
public Datum(final GenomeLoc activeRegionLoc, final GenomeLoc extendedLoc, final boolean active, final int nReads) {
this.activeRegionLoc = activeRegionLoc;
this.extendedLoc = extendedLoc;
isActive = active;
this.nReads = nReads;
}
}
boolean coinFlip = false;
@Override
public double isActive( final RefMetaDataTracker tracker, final ReferenceContext ref, final AlignmentContext context ) {
if( GenomeAnalysisEngine.getRandomGenerator().nextDouble() > 0.9995 ) {
coinFlip = !coinFlip;
}
return ( coinFlip ? 0.9995 : 0.0 );
}
@Override
public Datum map( final ActiveRegion activeRegion, final RefMetaDataTracker tracker ) {
return new Datum(activeRegion.getLocation(), activeRegion.getExtendedLoc(), activeRegion.isActive, activeRegion.size());
}
@Override
public GATKReport reduceInit() {
return GATKReport.newSimpleReport("CountReadsInActiveRegions", "loc", "extended.loc", "is.active", "n.reads");
}
@Override
public GATKReport reduce( final Datum value, final GATKReport report ) {
report.addRow(value.activeRegionLoc.toString(), value.extendedLoc.toString(), value.isActive, value.nReads);
return report;
}
@Override
public void onTraversalDone(final GATKReport report) {
report.print(out);
}
}
This code produces the following table:
#:GATKReport.v1.0:1 #:GATKTable:false:4:7:::::; #:GATKTable:CountReadsInActiveRegions:A simplified GATK table report loc extended.loc is.active n.reads 20:10017935-10018360 20:10017885-10018410 false 329 20:10018361-10018773 20:10018311-10018823 false 307 20:10077202-10077627 20:10077152-10077677 false 316 20:10077628-10077801 20:10077578-10077851 false 127 20:10077802-10078042 20:10077752-10078092 true 145 20:10096866-10097291 20:10096816-10097341 true 323 20:10097292-10097701 20:10097242-10097751 true 261
Full GATK Report
The normal GATK Report contains much more functionality but is more complex to set up.
Examples
Here, we create a GATK report with three tables. Each column has its own attributes with default values and format strings specified.
// Create a new GATK report
GATKReport report1 = new GATKReport();
// Add a table with specified name and description
report1.addTable("TableName", "To contain some more data types");
// Retrieve the newly created table
GATKReportTable table = report1.getTable("TableName");
// Add a primary key that will be shown
table.addPrimaryKey("key", true);
// Each column here is typed and created with name, default value,
// a boolean for whether or not it will be displayed, and a format string
table.addColumn("SomeInt", 0, true, "%d");
table.addColumn("SomeFloat", 0.0, true, "%.16E");
// Fill in the data, not that when the value for a certain key and column is not specified,
// it uses the default value.
table.set("Bob", "SomeInt", 34);
table.set("Bob", "SomeFloat", 34.0);
table.set("Tim", "SomeInt", -1);
table.set("Rob", "SomeFloat", 0.000003);
table.set("Rob", "SomeInt", 99);
table.set("Roger", "SomeFloat", 1234.5);
// Create a second table
report1.addTable("Table2", "Description");
// Create a primary key that will be hidden
report1.getTable("Table2").addPrimaryKey("cycle", false);
// Create a typed column of type Decimal
report1.getTable("Table2").addColumn("Error Rate", 0.0, true, "%.4e");
// Create an untyped column named "Column" with "null" for a default value
report1.getTable("Table2").addColumn("Column", "empty" );
// Fill in the data
report1.getTable("Table2").set(0, "Error Rate", 0.004353);
report1.getTable("Table2").set(1, "Error Rate", 0.013452);
report1.getTable("Table2").set(2, "Error Rate", 0.0);
report1.getTable("Table2").set(3, "Error Rate", 0.999);
// Print the report to console report1.print(System.out);
GATK Report Gatherer
The GATK report now comes with a funcitonality to gather reports for scatter-gather jobs (See Queue).This allows jobs to run in parallel with their data combined at the end. Using the GATK Report Gatherer is very easy. The current features of the gatherer are limited to to combining every row into one big report. More advanced gathering techniques can be added later by popular demand.
Examples
In your walker, you simply need to include the @Gather annotation to your output report as shown below.
@Output @Gather(GATKReportGatherer.class) PrintStream out;
For an example walker that implements the gatherer, see the simple GATK report example. The walker generate scala scripts that will allow queue to scatter gather those jobs. Consult Queue's documentation for how to use scala scripts to scatter-gather your jobs.
Format String
In the GATK report, every column can contain a format specifed by a format string. This string will be applied to the data using the String.format() function. the format string dictates the column type. Having a format string is not required but when used, the column has a column type which will enable arithmetic gathering inthe future.
Examples
%.8f
This will display a numeric object with 8 digit decimal precision. The column will adopt the Decimal column type.
%d
This will display an integer. The column will adopt the Integer type.
%c
This will display a character. The column will adopt the Character type. NOTE: to display the character, the value of the character must be within the displayable ASCII range.
%s
This will display the data using the .toString() mehtod. The column will adopt the String type.
When a format string is not specified, the column type will be Unknown. The data will be displayed using the Object's .toString() mehtod.
For more format string examples, see Java's documentation on String.format();
Definitions
Report header
The first line, structured as:
#:GATKReport.<version>:<number of tables>
Table header
The first two lines of every table, containing the metadata, a unique name for each column in the table.
The first column mentioned in the table header is the "primary key" column - a column that provides the unique identifier for each row in the table. Once this column is created, any element in the table can be referenced by the row-column coordinate, i.e. "primary key"-"column name" coordinate.
When a column is added to a table, a default value must be specified (usually 0). This is the initial value for an element in a column. This permits operations like increment() and decrement() to work properly on columns that are effectively counters for a particular event.
Finally, the display property for each column can be set during column creation. This is useful when a given column stores an intermediate result that will be used later on, perhaps to calculate the value of another column. In these cases, it's obviously necessary to store the value required for further computation, but it's not necessary to actually print the intermediate column.
Column header
The next row of the table, containing the primary key name (if displayed) and the column names.
Table body
The values of the table itself.
Technical Details
Implementation
The implementation of this table has two components:
- A TreeSet<Object> that stores all the values ever specified for the primary key. Any get() operation that refers to an element where the primary key object does not exist will result in its implicit creation. I haven't yet decided if this is a good idea...
- A HashMap<String, GATKReportColumn> that stores a mapping from column name to column contents. Each GATKReportColumn is effectively a map (in fact, GATKReportColumn extends TreeMap<Object, Object>) between primary key and the column value. This means that, given N columns, the primary key information is stored N+1 times. This is obviously wasteful and can likely be handled much more elegantly in future implementations.
Element and column operations
In addition to simply getting and setting values, this object also permits some simple operations to be applied to individual elements or to whole columns. For instance, an element can be easily incremented without the hassle of calling get(), incrementing the obtained value by 1, and then calling set() with the new value. Also, some vector operations are supported. For instance, two whole columns can be divided and have the result be set to a third column. This is especially useful when aggregating counts in two intermediate columns that will eventually need to be manipulated row-by-row to compute the final column.
GSAlib
The gsalib R library offers a facility to load GATKReport files. To use this function, you must have a checkout of the Sting codebase. Then, follow these steps:
1. Compile the gsalib library:
$ ant gsalib
Buildfile: build.xml
gsalib:
[exec] * installing *source* package ?gsalib? ...
[exec] ** R
[exec] ** data
[exec] ** preparing package for lazy loading
[exec] ** help
[exec] *** installing help indices
[exec] ** building package indices ...
[exec] ** testing if installed package can be loaded
[exec]
[exec] * DONE (gsalib)
BUILD SUCCESSFUL
2. Tell R where to find the gsalib library by adding the path in your ~/.Rprofile (you may need to create this file if it doesn't exist):
$ cat .Rprofile
.libPaths("/path/to/Sting/R/")
3. Start R and load the gsalib library:
$ R R version 2.11.0 (2010-04-22) Copyright (C) 2010 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > library(gsalib)
4. Finally, load the GATKReport file:
> d = gsa.read.gatkreport("/path/to/my.gatkreport")
> summary(d)
Length Class Mode
CountVariants 27 data.frame list
CompOverlap 13 data.frame list