Tagged with #coding-standards
2 documentation articles | 0 events or announcements | 0 forum discussions


Introduction

This document describes the workflow we use within GSA to do coverage analysis of the GATK codebase. It is primarily meant as an internal reference for team members, but are making it public to provide an example of how we work. There are a few mentions of internal server names etc.; please just disregard those as they will not be applicable to you.

Build the GATK, and run tests with clover

ant clean with.clover unittest

Note that you have to explicitly disable scala (due to a limitation in how it's currently integrated in build.xml). Note you can use things like -Dsingle="ReducerUnitTest" as well.

It seems that clover requires a lot of memory, so a few things are necessary:

setenv ANT_OPTS "-Xmx8g"

There's plenty of memory on gsa4, so it's not a problem to require so much memory

Getting more detailed reports

You can add the argument -Dclover.instrument.level=statement if you want line-level resolution on the report, but note this is astronomically expensive for the entire unit test suite. It's fine though if you want to run specific run tests.

Generate the report

> ant clover.report
Buildfile: /Users/depristo/Desktop/broadLocal/GATK/unstable/build.xml

clover.report:
[clover-html-report] Clover Version 3.1.8, built on November 13 2012 (build-876)
[clover-html-report] Loaded from: /Users/depristo/Desktop/broadLocal/GATK/unstable/private/resources/clover/lib/clover.jar
[clover-html-report] Clover: Community License registered to Broad Institute.
[clover-html-report] Loading coverage database from: '/Users/depristo/Desktop/broadLocal/GATK/unstable/.clover/clover3_1_8.db'
[clover-html-report] Writing HTML report to '/Users/depristo/Desktop/broadLocal/GATK/unstable/clover_html'
[clover-html-report] Done. Processed 132 packages in 20943ms (158ms per package).
    [mkdir] Created dir: /Users/depristo/private_html/report/clover
     [copy] Copying 4545 files to /Users/depristo/private_html/report/clover

BUILD SUCCESSFUL

The clover files are present in a subdirectory clover_html as well as copied to your private_html/report directory. Note this can be very expensive given our large number of tests. For example, I've been waiting for the report to generate for nearly an hour on gsa4.

Doing it all at once

ant clean with.clover unittest clover.report

will clean the source, rebuild with clover engaged, run the unit tests, and generate the clover report. Note that currently unittests may be failing due to classcast and other exceptions in the clover run. We're looking into it. But you can still run clover.report after the failed run, as the db contains all of the run information, even through it failed (though failed methods won't be counted).

Here's a real-life example of assessing coverage in all BQSR utilities at once:

ant clean with.clover unittest -Dclover.instrument.level=statement -Dsingle="recalibration/*UnitTest" clover.report

Current annoyance

Clover can make the tests very slow. Currently we are run in method count only mode (we don't have line number resolution (looking into fixing this). Also note that running with clover over the entire unittest set requires 32G of RAM (set automatically by ant).

This produces an HTML report that looks like the following screenshots

Using clover to make better unittests

This workflow is appropriate for developing unit tests for a single package or class. The turn-around time for clover on a single package is very fast, even with statement-level coverage. The overall workflow looks like:

  1. run unittests with clover enabled for your package or class.
  2. explore clover HTML report, noting places where test coverage is lacking
  3. expand unit tests
  4. repeat until satisfied

Here's a concrete example. Right now I'm looking at the unit test coverage for GenomeLoc, one of the earliest and most important classes in the GATK. I really want good unit test coverage here. So I start by running GenomeLoc unit tests specifically:

ant clean with.clover unittest -Dsingle="GenomeLocUnitTest" -Dclover.instrument.level=statement clover.report

Next, I open up the clover coverage report in clover_html/index.html in my GATK directory, and landing on the Dashboard. Everything looks pretty bad, but that's because I only ran the GenomeLoc tests, and it displays the entire project coverage. I click on the "Coverage" link in the upper-left frame, and scroll down to the package where GenomeLoc lives (org.broadinstitute.sting.utils). At the bottom of this page I find my two classes, GenomeLoc and GenomeLocParser.CachingSequenceDictionary:

These have ~50% statement-level coverage each. Not ideal, really.

Let's dive into GenomeLoc itself a bit more. Clicking on the GenomeLoc link brings up to the code coverage page. Here you can see a few things very quickly.

  • Some of the methods are greyed out. This is because they are considered by our clover report as trivial getter/setter methods, and shouldn't be counted.
  • Some methods have reasonably good test coverage, such as disjointP with thousands of tests.
  • Some methods have some tests, but a very limited number, such as contiguousP which only has 2 tests. Now maybe that's enough, but it's worth thinking about whether 2 tests would really cover all of the test cases for this method.
  • Some methods (such as intersect) have good coverage on some branches but no coverage on what looks like an important branch (the unmapped handling code).
  • Some methods just don't have any tests at all (subtract), which is very dangerous if this method is an important one used throughout the GATK.

For methods with poor test coverage (branches or overall) I'd look into their uses, and try to answer a few questions:

  • How widely used this is function? Is this method used at all? Perhaps it's just unused code that can be deleted. Perhaps its only used in one specific class, and it's not worth my time testing it (a dangerous statement, as basically any untested code can assumed to be broken now, or some point in the future). If it's widely used, I should design some unit tests for it.
  • Are the uses simpler than the full code itself? Perhaps a simpler function can be extracted, and it tested.

If the code needs tests, I would design specific unit tests (or data providers that cover all possible cases) for these function. Once that newly-written code is in place, I would rerun the ant tasks above to get updated coverage information, and continue until I'm satisfied.

Introduction

This document describes the current GATK coding standards for documentation and unit testing. The overall goal is that all functions be well documented, have unit tests, and conform to the coding conventions described in this guideline. It is primarily meant as an internal reference for team members, but we are making it public to provide an example of how we work. There are a few mentions of specific team member responsibilities and who to contact with questions; please just disregard those as they will not be applicable to you.

Coding conventions

General conventions

The Genome Analysis Toolkit generally follows Java coding standards and good practices, which can be viewed at Sun's site.

The original coding standard document for the GATK was developed in early 2009. It remains a reasonable starting point but may be superseded by statements on this page (available as a PDF).

Size of functions and functional programming style

Code in the GATK should be structured into clear, simple, and testable functions. Clear means that the function takes a limited number of arguments, most of which are values not modified, and in general should return newly allocated results, as opposed to directly modifying the input arguments (functional style). The max. size of functions should be approximately one screen's worth of real estate (no more than 80 lines), including inline comments. If you are writing functions that are much larger than this, you must refactor your code into modular components.

Code duplication

Do not duplicate code. If you are finding yourself wanting to make a copy of functionality, refactor the code you want to duplicate and enhance it. Duplicating code introduces bugs, makes the system harder to maintain, and will require more work since you will have a new function that must be tested, as opposed to expanding the tests on the existing functionality.

Documentation

Functions must be documented following the javadoc conventions. That means that the first line of the comment should be a simple statement of the purpose of the function. Following that is an expanded description of the function, such as edge case conditions, requirements on the argument, state changes, etc. Finally comes the @param and @return fields, that should describe the meaning of each function argument, restrictions on the values allowed or returned. In general, the return field should be about types and ranges of those values, not the meaning of the result, as this should be in the body of the documentation.

Testing for valid inputs and contracts

The GATK uses Contracts for Java to help us enforce code quality during testing. See CoFoJa for more information. If you've never programmed with contracts, read their excellent description Adding contracts to a stack. Contracts are only enabled when we are testing the code (unittests and integration tests) and not during normal execution, so contracts can be reasonably expensive to compute. They are best used to enforce assumptions about the status of class variables and return results.

Contracts are tricky when it comes to input arguments. The best practice is simple:

  • Public functions with arguments should explicitly test those input arguments for good values with live java code (such as in the example below). Because the function is public, you don't know what the caller will be passing in, so you have to check and ensure quality.
  • Private functions with arguments should use contracts instead. Because the function is private, the author of the code controls use of the function, and the contracts enforce good use. But in principal the quality of the inputs should be assumed at runtime since only the author controlled calls to the function and input QC should have happened elsewhere

Below is an example private function that makes good use of input argument contracts:

/**
 * Helper function to write out a IGV formatted line to out, at loc, with values
 *
 * http://www.broadinstitute.org/software/igv/IGV
 *
 * @param out a non-null PrintStream where we'll write our line
 * @param loc the location of values
 * @param featureName string name of this feature (see IGV format)
 * @param values the floating point values to associate with loc and feature name in out
 */
@Requires({
        "out != null",
        "loc != null",
        "values.length > 0"
})
private void printIGVFormatRow(final PrintStream out, final GenomeLoc loc, final String featureName, final double ... values) {
    // note that start and stop are 0 based, but the stop is exclusive so we don't subtract 1
    out.printf("%s\t%d\t%d\t%s", loc.getContig(), loc.getStart() - 1, loc.getStop(), featureName);
    for ( final double value : values )
        out.print(String.format("\t%.3f", value));
    out.println();
} 

Final variables

Final java fields cannot be reassigned once set. Nearly all variables you write should be final, unless they are obviously accumulator results or other things you actually want to modify. Nearly all of your function arguments should be final. Being final stops incorrect reassigns (a major bug source) as well as more clearly captures the flow of information through the code.

An example high-quality GATK function

/**
 * Get the reference bases from referenceReader spanned by the extended location of this active region,
 * including additional padding bp on either side.  If this expanded region would exceed the boundaries
 * of the active region's contig, the returned result will be truncated to only include on-genome reference
 * bases
 * @param referenceReader the source of the reference genome bases
 * @param padding the padding, in BP, we want to add to either side of this active region extended region
 * @param genomeLoc a non-null genome loc indicating the base span of the bp we'd like to get the reference for
 * @return a non-null array of bytes holding the reference bases in referenceReader
 */
@Ensures("result != null")
public byte[] getReference( final IndexedFastaSequenceFile referenceReader, final int padding, final GenomeLoc genomeLoc ) {
    if ( referenceReader == null ) throw new IllegalArgumentException("referenceReader cannot be null");
    if ( padding < 0 ) throw new IllegalArgumentException("padding must be a positive integer but got " + padding);
    if ( genomeLoc == null ) throw new IllegalArgumentException("genomeLoc cannot be null");
    if ( genomeLoc.size() == 0 ) throw new IllegalArgumentException("GenomeLoc must have size > 0 but got " + genomeLoc);

    final byte[] reference =  referenceReader.getSubsequenceAt( genomeLoc.getContig(),
            Math.max(1, genomeLoc.getStart() - padding),
            Math.min(referenceReader.getSequenceDictionary().getSequence(genomeLoc.getContig()).getSequenceLength(), genomeLoc.getStop() + padding) ).getBases();

    return reference;
}

Unit testing

All classes and methods in the GATK should have unit tests to ensure that they work properly, and to protect yourself and others who may want to extend, modify, enhance, or optimize you code. That GATK development team assumes that anything that isn't unit tested is broken. Perhaps right now they aren't broken, but with a team of 10 people they will become broken soon if you don't ensure they are correct going forward with unit tests.

Walkers are a particularly complex issue. UnitTesting the map and reduce results is very hard, and in my view largely unnecessary. That said, you should write your walkers and supporting classes in such a way that all of the complex data processing functions are separated from the map and reduce functions, and those should be unit tested properly.

Code coverage tells you how much of your class, at the statement or function level, has unit testing coverage. The GATK development standard is to reach something >80% method coverage (and ideally >80% statement coverage). The target is flexible as some methods are trivial (they just call into another method) so perhaps don't need coverage. At the statement level, you get deducted from 100% for branches that check for things that perhaps you don't care about, such as illegal arguments, so reaching 100% statement level coverage is unrealistic for most clases.

You can find out more information about generating code coverage results at Analyzing coverage with clover

We've created a unit testing example template in the GATK codebase that provides examples of creating core GATK data structures from scratch for unit testing. The code is in class ExampleToCopyUnitTest and can be viewed here in github directly ExampleToCopyUnitTest.

The GSA-Workflow

As of GATK 2.5, we are moving to a full code review process, which has the following benefits:

  • Reducing obvious coding bugs seen by other eyes
  • Reducing code duplication, as reviewers will be able to see duplicated code within the commit and potentially across the codebase
  • Ensure that coding quality standards are met (style and unit testing)
  • Setting a higher code quality standard for the master GATK unstable branch
  • Providing detailed coding feedback to newer developers, so they can improve their skills over time

The GSA workflow in words :

  • Create a new branch to start any work. Never work on master.
    • branch names have to follow the convention of [author prefix][feature name][JIRA ticket] (e.g. rp_pairhmm_GSA-232)
  • Make frequent commits.
  • Push frequently your branch to origin (branch -> branch)
  • When you're done -- rewrite your commit history to tell a compelling story Git Tools Rewriting History
  • Push your rewritten history, and request a code review.
    • The entire GSA team will review your code
    • Mark DePristo assigns the reviewer responsible for making the judgment based on all reviews and merge your code into master.
  • If your pull-request gets rejected, follow the comments from the team to fix it and repeat the workflow until you're ready to submit a new pull request.
  • If your pull-request is accepted, the reviewer will merge and remove your remote branch.

Example GSA workflow in the command line:

# starting a new feature
git checkout -b rp_pairhmm_GSA-332
git commit -av 
git push -u origin rp_pairhmm_GSA-332

# doing work on existing feature
git commit -av
git push

# ready to submit pull-request
git fetch origin
git rebase -i origin/master
git push -f

# after being accepted, delete your branch
git checkout master 
git pull
git branch -d rp_pairhmm_GSA-332
(the reviewer will remove your github branch)

Commit histories and rebasing

You must commit your code in small commit blocks with commit messages that follow the git best practices, which require the first line of the commit to summarize the purpose of the commit, followed by -- lines that describe the changes in more detail. For example, here's a recent commit that meets this criteria that added unit tests to the GenomeLocParser:

Refactoring and unit testing GenomeLocParser

-- Moved previously inner class to MRUCachingSAMSequenceDictionary, and unit test to 100% coverage
-- Fully document all functions in GenomeLocParser
-- Unit tests for things like parsePosition (shocking it wasn't tested!)
-- Removed function to specifically create GenomeLocs for VariantContexts.  The fact that you must incorporate END attributes in the context means that createGenomeLoc(Feature) works correctly
-- Depreciated (and moved functionality) of setStart, setStop, and incPos to GenomeLoc
-- Unit test coverage at like 80%, moving to 100% with next commit

Now, git encourages you to commit code often, and develop your code in whatever order or what is best for you. So it's common to end up with 20 commits, all with strange, brief commit messages, that you want to push into the master branch. It is not acceptable to push such changes. You need to use the git command rebase to reorganize your commit history so satisfy the small number of clear commits with clear messages.

Here is a recommended git workflow using rebase:

  1. Start every project by creating a new branch for it. From your master branch, type the following command (replacing "myBranch" with an appropriate name for the new branch):

    git checkout -b myBranch
    

    Note that you only include the -b when you're first creating the branch. After a branch is already created, you can switch to it by typing the checkout command without the -b: "git checkout myBranch"

    Also note that since you're always starting a new branch from master, you should keep your master branch up-to-date by occasionally doing a "git pull" while your master branch is checked out. You shouldn't do any actual work on your master branch, however.

  2. When you want to update your branch with the latest commits from the central repo, type this while your branch is checked out:

    git fetch && git rebase origin/master
    

    If there are conflicts while updating your branch, git will tell you what additional commands to use.

    If you need to combine or reorder your commits, add "-i" to the above command, like so:

    git fetch && git rebase -i origin/master
    

    If you want to edit your commits without also retrieving any new commits, omit the "git fetch" from the above command.

If you find the above commands cumbersome or hard to remember, create aliases for them using the following commands:

    git config --global alias.up '!git fetch && git rebase origin/master'
    git config --global alias.edit '!git fetch && git rebase -i origin/master'
    git config --global alias.done '!git push origin HEAD:master'

Then you can type "git up" to update your branch, "git edit" to combine/reorder commits, and "git done" to push your branch.

Here are more useful tutorials on how to use rebase:

If you need help with rebasing, talk to Mauricio or David and they will help you out.

Sorry, there are no publicly available documents of this type with the tag #coding-standards. Try one of the other types.
Sorry, there are no publicly available documents of this type with the tag #coding-standards. Try one of the other types.