Broad moves genome analysis to the cloud; collaborates with cloud providers to offer access to the leading genome analysis toolkit

At the Broad Institute of MIT and Harvard we generate a lot of data. We also develop the cutting-edge software tools researchers need to find signals in the noise. We are committed to sharing tools and data openly with the entire scientific community, and have dedicated a team to constantly improve our Genome Analysis Toolkit (GATK), which is the software package we developed for analysis of high-throughput sequencing data.

GP Platform
GP Platform

At the Broad Institute of MIT and Harvard we generate a lot of data. We also develop the cutting-edge software tools researchers need to find signals in the noise. We are committed to sharing tools and data openly with the entire scientific community, and have dedicated a team to constantly improve our Genome Analysis Toolkit (GATK), which is the software package we developed for analysis of high-throughput sequencing data.

Improving the GATK is no longer just about enhancing variant discovery and genotyping—it’s about removing the technical barriers that can stand in the way of tackling ambitious projects at scale. The amount of genomic data on earth doubles about every eight months, including here at the Broad, where our sequencers generate 14 gigabytes of genomic data every minute—about 20 terabytes per day. Meanwhile, the GATK suite of tools is becoming more and more popular with researchers worldwide, with 31,000 registered users using the software on laptops, local server farms, and computing clusters at academic and commercial institutions.

We want to enable these researchers to think big about their projects and think less about their local computing system’s ability to handle them. We needed a solution that would work both for us at the Broad and that we could offer as an option for researchers everywhere.

We have spent the last year engineering, testing, and launching a method to analyze large amounts of genomic data with GATK that takes advantage of the scale and speed of cloud computing environments. Working together with Google, we developed a cloud-based GATK that Broad researchers are now using to generate genomic insights. This meant creating an entirely new system for developing and deploying pipelines as well as a new framework for wet lab quality control that uncouples data generation from data processing. It has worked so well that we have completely ported the largest and most important of our production pipelines, the Whole Genome Sequencing Pipeline, to the Google Cloud Platform. We are beginning to run production jobs on GCP and will be switching over entirely this month. (For more, read our post about this on the Google Research Blog)

The next step is to make this option available to GATK users outside the Broad—and to open it up to other cloud providers as well.

Today we are proud to announce that we are collaborating with Amazon Web Services (AWS), Cloudera, Google, IBM, Intel, and Microsoft with plans to enable cloud-based access to GATK. Through these collaborations we plan to make the GATK Best Practices pipeline available to users through a software-as-a-service (SaaS) mechanism, expanding access beyond traditional ‘desktop’ solutions and providing an additional option for the GATK.

We expect to launch more partnerships in the future as other trusted providers join the collaboration. Across all platforms—cloud-based and locally-run—Broad will continue to update, enhance, and troubleshoot the full GATK toolkit to ensure it’s always the best choice for generating meaningful genomic insights.

As excited as we are about these new cloud-based options for genomic research, the challenges of executing large genomic workflows and issues around storage and scalability will continue to exist. This is why Broad engineers are working with Intel to speed variant detection and biomarker discovery and enable discoveries that could not have been detected with smaller cohorts.

These collaborations will also help us drive the development of GATK4, the next generation of GATK. GATK4 will utilize the Spark distributed computing framework to facilitate parallelism and in-memory computations, thus speeding up the methods. GATK4 will also extend the range of use cases supported by GATK to include cancer, structural variation, copy number variation, and more.

Our experience performing cloud-based genome analysis also prompted us to write a home-grown programming language, which we call Workflow Description Language, or WDL. The WDL lets us manage the movement of analytical pipelines to the cloud, so whenever we update GATK, users of partner systems can always be on the latest version. Starting in late 2016, users of some Illumina systems will soon have access to the latest version of GATK through BaseSpace, which is Illumina’s cloud-based platform for next-generation sequencing data management and analysis.

This is all part of our effort to build and deploy best-in-class, scalable, and widely accessible tools that can help researchers make meaningful insights in genomic data. We believe that by offering GATK as a direct download and by developing a software-as-a-service cloud-based option we can encourage discovery while breaking down some of the infrastructure barriers that can stand in the way of innovation.