GATK development process

From GSA
Jump to: navigation, search

Contents

Introduction and motivation

Before June, 2011, the GATK was developed using Subversion (SVN), a standard centralized version control system. Although SVN was sufficient for our group development needs in the past, the GATK team is now larger, more diverse, and is composed of several independent developer teams: a methods development group, a technology development and analysis group, and a production data processing group. Each team has specific development goals that are in conflict when sharing code through a centralized repository:

  • The methods development team needs to share and modify code reflecting the latest, best tools for NGS analysis.
  • The analysis team has many experimental, short-lived, and not-intended-for-the-public tools for specific analyses.
  • The production team wants to run reliable and stable tools, and support and improve these versions of the tools long after the development team has moved on.

To better satisfy these conflicting goals, the GATK team decided to migrate from Subversion to Git, a distributed version control system (DVCS) used by many prominent open-source projects such as the Linux kernel, the GNU gcc compiler, and Google's Android. A DVCS has several important advantages over a centralized system like SVN:

  • Distributed development model: each developer has their own independent set of repositories complete with full revision histories, and can commit, revert, branch, merge, and perform other version control activities locally without being forced to share their changes until they're ready to do so.
  • Branching and merging become an unavoidable fact of life in a distributed development environment, and so these operations are designed to be both easier to perform and more robust than their counterparts in centralized systems like SVN.
  • The options for sharing changes between individual developers and groups of developers are much richer. Changes no longer need to be published to a single centralized location in order to be easily shared: developers can set up their own private networks of repositories to collaborate on subprojects before pushing work into the group-wide repositories.

Taken together, these DVCS features enable each team within our group to pursue its goals independently with less opportunity for conflict:

  • Methods development can work on cutting-edge tools in a repository that is separate from the repository used by production to build stable tools with well-defined performance characteristics. The new features introduced by methods development can be periodically merged into the production repository once they've stabilized.
  • Private, not-intended-for-the-public tools like those created by the analysis team can propagate freely within our internal repositories but be held back from our publicly-accessible repository.
  • Bug fixes can easily be propagated to all of our repositories at once.

Overview of the new GATK development model

Private Repositories

The new GATK development process is organized at the highest level around two private, Broad-only repositories:

  • Unstable: All development of new features takes place in this repository.
  • Stable: Contains the latest stable release of the GATK. When a bug fix to the latest release is required, it is pushed into both Stable and Unstable simultaneously.

Periodically, a new GATK release will be planned, key features for the release identified, and tools and capabilities in Unstable all brought up to a high quality standard, both for individual tools, interaction between tools, and documentation. Once these milestones are reached, a merge of Unstable into Stable will be performed, and the code in Stable will be tagged with a version number.

At a lower level, individual developers within GSA each have their own independent child repositories cloned from Unstable and Stable, and can at their discretion create additional shared repositories for collaborative work outside of the group-wide repositories.

Public Repository

Our publicly-accessible Release repository is hosted on github. It automatically mirrors our internal Stable repository, but lacks any private tools in Stable not intended for public consumption. We also maintain a continuously-updated binary release built from the latest contents of our github repository.

Instructions for downloading the latest binary and/or source releases can be found here

What users of the public binary and source releases need to know

Below are the major changes that users of the public binary/source releases can expect to encounter as a result of our move to a new development process:

  • The latest public GATK binary release will be supported: whenever a bug fix is pushed into our release repository, the GATK jar file will be automatically updated for users. This means that no longer will GATK users need to build from source to get bug fixes. This should significantly reduce the exposure of GATK users to the GATK development process.
  • The public release (both the source release and the binary release) will lack certain tools and directories that were formerly made public, but never intended for public consumption. This is in part to avoid unnecessarily exposing users to in-process, experimental tools, and generating support requests for tools that we don't officially support.
  • If you want to download and build the GATK source, you will need to use git rather than svn. The latest version of git can be downloaded here.

What GATK developers need to know

GATK developers should consult the following guide to using git for development within our new repository structure:

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox