Broad Institute teams up with AWS, Cloudera, Google, IBM, Intel, and Microsoft to enable cloud-based access to the Genome Analysis Toolkit, simplifying genomic research
By Broad Communications
April 5, 2016
Broad Institute of MIT and Harvard is collaborating with Amazon Web Services (AWS), Cloudera, Google, IBM, Intel, and Microsoft to enable cloud-based access to its Genome Analysis Toolkit (GATK) software package. Through these collaborations the GATK Best Practices pipeline will be available to users of cloud service providers through a software-as-a-service (SaaS) mechanism, expanding access beyond traditional desktop solutions. Broad will also work with collaborators to drive the creation of the next generation of GATK based on the Apache SparkTM computing framework.
“By providing a cloud-hosted solution, we can greatly expand access and facilitate usage of these genome analysis tools,” said Eric Banks, senior director of Data Sciences and Data Engineering at Broad and a creator of the GATK software package. “There are currently more than 31,000 registered users of the Broad Institute’s GATK. The vast majority set up an extensive local compute and storage infrastructure to process the huge amount of information required to conduct genomic analyses. These collaborations will provide new options that can remove traditional barriers of scale while offering the same high level of data quality.”
This effort expands existing efforts that began with the June 2015 alpha offering of GATK on Google Cloud Platform, to include additional cloud providers. (For an update on this project see this April 5 Google Research Blog post)
“Since the alpha launch of Broad Institute’s GATK on Google Genomics last summer, there has been a tremendous amount of interest. We have run many thousands of samples through this pipeline for a variety of users. We’ve also optimized the pipeline to make it remarkably cost effective,” said David Glazer, director of Google Genomics. “Working with Broad Institute to build and launch this pipeline has provided a powerful demonstration of Google Cloud Platform’s ability to accelerate life science.”
“It is a pleasure to be working with Broad to offer GATK on Microsoft Azure,” said David Heckerman of Microsoft Genomics. “This will greatly facilitate research and clinical genomic analyses.”
“As genomic data increasingly plays a role in research and treatment, cloud-based access to powerful analytic tools like GATK will be critical to accelerate precision medicine,” said Steve Harvey, vice president of Watson Health and head of Watson for Genomics. “We are eager to support data-driven insights for clinicians and researchers through the Watson Health Cloud.”
Users should be able to access cloud-based GATK options beginning later this year. Pricing will vary depending on the provider. The GATK will continue to be available for existing and new users to download and deploy on their local infrastructure, provided by Broad Institute free for academic research and via a licensing fee for commercial users.
Beyond the cloud to GATK4
These collaborations will also help the Broad Institute drive the development of GATK4, the next generation of GATK. GATK4 will utilize the Spark distributed computing framework to facilitate parallelism and in-memory computations, thus speeding up the methods. GATK4 will also extend the range of use cases supported by GATK to include cancer, structural variation, copy number variation, and more.
Already, Cloudera, Intel, and Google have contributed to the development of GATK4. “Cloudera’s early commitment to Spark drove us to be the first Hadoop vendor to ship, support, and offer Spark training in 2014. We are honored to apply our expertise to the downstream multi-omic analysis space, investing in Spark as a bioinformatics standard, and working with Broad to create the next generation of GATK,” said Shawn Dolley, industry leader of Life Sciences at Cloudera.
“Optimizing GATK for cloud-based access will accelerate the utilization of genomic data to fuel new insights into disease and treatment,” said Eric Dishman, Intel vice president, Health and Life Sciences. “To tackle one of the biggest big data challenges, Intel is also working closely with the Broad Institute in co-developing tools compatible with GATK to eliminate the barriers to more effective and widespread use of large scale genomic workloads.”