GATKReport

From GSA
Jump to: navigation, search

The GATKReport output file is a single text file containing multiple tables, formatted to be both human-readable and computationally accessible in languages like R.

Contents

GATK Report

A data structure that allows data to be collected over the course of a walker's computation, then have that data written to a PrintStream such that it's human-readable, AWK-able, and R-friendly (given that you load it using the GATKReport loader module).

The goal of this object is to use the same data structure for both accumulating data during a walker's computation and emitting that data to a file for easy analysis in R (or any other program/language that can take in a table of results). Thus, all of the infrastructure below is designed simply to make printing the following as easy as possible.

Below is an example table:

#:GATKReport.v1.0:2
#:GATKTable:true:2:9:%.18E:%.15f:;
#:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads
cycle  errorrate.61PA8.7         qualavg.61PA8.7                                         
0      7.451835696110506E-3      25.474613284804366                                      
1      2.362777171937477E-3      29.844949954504095                                      
2      9.087604507451836E-4      32.875909752547310
3      5.452562704471102E-4      34.498999090081895                                      
4      9.087604507451836E-4      35.148316651501370                                       
5      5.452562704471102E-4      36.072234352256190                                       
6      5.452562704471102E-4      36.121724890829700                                        
7      5.452562704471102E-4      36.191048034934500                                        
8      5.452562704471102E-4      36.003457059679770                                       
   
#:GATKTable:false:2:3:%s:%c:;
#:GATKTable:TableName:Description
key    column
1:1000  T 
1:1001  A 
1:1002  C 

Here, we have a GATKReport - a well-formatted, easy to read representation of some tabular data. It begins with the report header and version. This report contains two individual GATK report tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data. Any object can be accpeted as data, unless the column is typed. In that case, the data must match the column type.

Usages

Simple GATK Report

The simple GATK report is an easy way to collect data and output it to a file.

A simple GATK Report consists of the following:

  • A single table
  • No primary key ( it is hidden )
  • Optional:
    • Only untyped columns. As long as the data is an Object, it will be accepted.
    • Default column values being empty strings.

Limitations

  • A simple GATK report cannot contain multiple tables.
  • It cannot contain typed columns, which prevents arithmetic gathering (a feature not yet implemented).

Example

The following code creates a simple GATK Report, fills it with data, and prints the output to console. The GATK report can and should be printed to an OutputFileStream.

// Create a new simple GATK report named "TableName" with columns: Roger, is, and Awesome
GATKReport report = GATKReport.newSimpleReport("TableName", "Roger", "is", "Awesome");

// Add data to simple GATK report
report.addRow( 12, 23.45, true);
report.addRow("string", 'A', 24.5D);
report.addRow("NaN", "", 2342000L);

// Print the report to console
report.print(System.out);

Below is an example of a walker using the simple GATK Report.

@ActiveRegionExtension(extension=50)
public class CountReadsInActiveRegions extends ActiveRegionWalker<CountReadsInActiveRegions.Datum, GATKReport> {
    @Output
@Gather(GATKReportGatherer.class)
PrintStream out; public static class Datum { private final GenomeLoc activeRegionLoc; private final GenomeLoc extendedLoc; public final boolean isActive; public int nReads; public Datum(final GenomeLoc activeRegionLoc, final GenomeLoc extendedLoc, final boolean active, final int nReads) { this.activeRegionLoc = activeRegionLoc; this.extendedLoc = extendedLoc; isActive = active; this.nReads = nReads; } } boolean coinFlip = false; @Override public double isActive( final RefMetaDataTracker tracker, final ReferenceContext ref, final AlignmentContext context ) { if( GenomeAnalysisEngine.getRandomGenerator().nextDouble() > 0.9995 ) { coinFlip = !coinFlip; } return ( coinFlip ? 0.9995 : 0.0 ); } @Override public Datum map( final ActiveRegion activeRegion, final RefMetaDataTracker tracker ) { return new Datum(activeRegion.getLocation(), activeRegion.getExtendedLoc(), activeRegion.isActive, activeRegion.size()); } @Override public GATKReport reduceInit() { return GATKReport.newSimpleReport("CountReadsInActiveRegions", "loc", "extended.loc", "is.active", "n.reads"); } @Override public GATKReport reduce( final Datum value, final GATKReport report ) { report.addRow(value.activeRegionLoc.toString(), value.extendedLoc.toString(), value.isActive, value.nReads); return report; } @Override public void onTraversalDone(final GATKReport report) { report.print(out); } }

This code produces the following table:

#:GATKReport.v1.0:1
#:GATKTable:false:4:7:::::;
#:GATKTable:CountReadsInActiveRegions:A simplified GATK table report
loc                   extended.loc          is.active  n.reads
20:10017935-10018360  20:10017885-10018410  false          329
20:10018361-10018773  20:10018311-10018823  false          307
20:10077202-10077627  20:10077152-10077677  false          316
20:10077628-10077801  20:10077578-10077851  false          127
20:10077802-10078042  20:10077752-10078092  true           145
20:10096866-10097291  20:10096816-10097341  true           323
20:10097292-10097701  20:10097242-10097751  true           261

Full GATK Report

The normal GATK Report contains much more functionality but is more complex to set up.

Examples

Here, we create a GATK report with three tables. Each column has its own attributes with default values and format strings specified.

// Create a new GATK report
GATKReport report1 = new GATKReport();

// Add a table with specified name and description
report1.addTable("TableName", "To contain some more data types");

// Retrieve the newly created table
GATKReportTable table = report1.getTable("TableName");

// Add a primary key that will be shown
table.addPrimaryKey("key", true);
// Each column here is typed and created with name, default value,
// a boolean for whether or not it will be displayed, and a format string 
table.addColumn("SomeInt", 0, true, "%d");
table.addColumn("SomeFloat", 0.0, true, "%.16E");

// Fill in the data, not that when the value for a certain key and column is not specified,
// it uses the default value.
table.set("Bob", "SomeInt", 34);
table.set("Bob", "SomeFloat", 34.0);
table.set("Tim", "SomeInt", -1);
table.set("Rob", "SomeFloat", 0.000003);
table.set("Rob", "SomeInt", 99);
table.set("Roger", "SomeFloat", 1234.5);

// Create a second table
report1.addTable("Table2", "Description");

// Create a primary key that will be hidden
report1.getTable("Table2").addPrimaryKey("cycle", false);
// Create a typed column of type Decimal
report1.getTable("Table2").addColumn("Error Rate", 0.0, true, "%.4e");
// Create an untyped column named "Column" with "null" for a default value
report1.getTable("Table2").addColumn("Column", "empty" );

// Fill in the data
report1.getTable("Table2").set(0, "Error Rate", 0.004353);
report1.getTable("Table2").set(1, "Error Rate", 0.013452);
report1.getTable("Table2").set(2, "Error Rate", 0.0);
report1.getTable("Table2").set(3, "Error Rate", 0.999);
// Print the report to console
report1.print(System.out);

GATK Report Gatherer

The GATK report now comes with a funcitonality to gather reports for scatter-gather jobs (See Queue).This allows jobs to run in parallel with their data combined at the end. Using the GATK Report Gatherer is very easy. The current features of the gatherer are limited to to combining every row into one big report. More advanced gathering techniques can be added later by popular demand.

Examples

In your walker, you simply need to include the @Gather annotation to your output report as shown below.

@Output
@Gather(GATKReportGatherer.class)
PrintStream out;

For an example walker that implements the gatherer, see the simple GATK report example. The walker generate scala scripts that will allow queue to scatter gather those jobs. Consult Queue's documentation for how to use scala scripts to scatter-gather your jobs.

Format String

In the GATK report, every column can contain a format specifed by a format string. This string will be applied to the data using the String.format() function. the format string dictates the column type. Having a format string is not required but when used, the column has a column type which will enable arithmetic gathering inthe future.

Examples

%.8f

This will display a numeric object with 8 digit decimal precision. The column will adopt the Decimal column type.

%d

This will display an integer. The column will adopt the Integer type.

%c

This will display a character. The column will adopt the Character type. NOTE: to display the character, the value of the character must be within the displayable ASCII range.

%s

This will display the data using the .toString() mehtod. The column will adopt the String type.

When a format string is not specified, the column type will be Unknown. The data will be displayed using the Object's .toString() mehtod.

For more format string examples, see Java's documentation on String.format();

Definitions

Report header

The first line, structured as:

#:GATKReport.<version>:<number of tables>

Table header

The first two lines of every table, containing the metadata, a unique name for each column in the table.

The first column mentioned in the table header is the "primary key" column - a column that provides the unique identifier for each row in the table. Once this column is created, any element in the table can be referenced by the row-column coordinate, i.e. "primary key"-"column name" coordinate.

When a column is added to a table, a default value must be specified (usually 0). This is the initial value for an element in a column. This permits operations like increment() and decrement() to work properly on columns that are effectively counters for a particular event.

Finally, the display property for each column can be set during column creation. This is useful when a given column stores an intermediate result that will be used later on, perhaps to calculate the value of another column. In these cases, it's obviously necessary to store the value required for further computation, but it's not necessary to actually print the intermediate column.

Column header

The next row of the table, containing the primary key name (if displayed) and the column names.

Table body

The values of the table itself.

Technical Details

Implementation

The implementation of this table has two components:

  1. A TreeSet<Object> that stores all the values ever specified for the primary key. Any get() operation that refers to an element where the primary key object does not exist will result in its implicit creation. I haven't yet decided if this is a good idea...
  2. A HashMap<String, GATKReportColumn> that stores a mapping from column name to column contents. Each GATKReportColumn is effectively a map (in fact, GATKReportColumn extends TreeMap<Object, Object>) between primary key and the column value. This means that, given N columns, the primary key information is stored N+1 times. This is obviously wasteful and can likely be handled much more elegantly in future implementations.

Element and column operations

In addition to simply getting and setting values, this object also permits some simple operations to be applied to individual elements or to whole columns. For instance, an element can be easily incremented without the hassle of calling get(), incrementing the obtained value by 1, and then calling set() with the new value. Also, some vector operations are supported. For instance, two whole columns can be divided and have the result be set to a third column. This is especially useful when aggregating counts in two intermediate columns that will eventually need to be manipulated row-by-row to compute the final column.

GSAlib

The gsalib R library offers a facility to load GATKReport files. To use this function, you must have a checkout of the Sting codebase. Then, follow these steps:

1. Compile the gsalib library:

$ ant gsalib
Buildfile: build.xml

gsalib:
     [exec] * installing *source* package ?gsalib? ...
     [exec] ** R
     [exec] ** data
     [exec] ** preparing package for lazy loading
     [exec] ** help
     [exec] *** installing help indices
     [exec] ** building package indices ...
     [exec] ** testing if installed package can be loaded
     [exec] 
     [exec] * DONE (gsalib)

BUILD SUCCESSFUL

2. Tell R where to find the gsalib library by adding the path in your ~/.Rprofile (you may need to create this file if it doesn't exist):

$ cat .Rprofile 
.libPaths("/path/to/Sting/R/")

3. Start R and load the gsalib library:

$ R

R version 2.11.0 (2010-04-22)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(gsalib)

4. Finally, load the GATKReport file:

> d = gsa.read.gatkreport("/path/to/my.gatkreport")
> summary(d)
              Length Class      Mode
CountVariants 27     data.frame list
CompOverlap   13     data.frame list
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox