You are here


News / 01.9.18

Broad Institute releases open-source GATK4 software for genome analysis, optimized for speed and scalability

Lauren Solomon, Broad Communications
Credit : Lauren Solomon, Broad Communications
By Broad Communications

New version of the leading genome analysis toolkit increases analysis scope and includes enhanced machine learning algorithms for greater performance.

Today the Broad Institute of MIT and Harvard is releasing version 4.0 of the Genome Analysis Toolkit (GATK), the institute's flagship genome variant discovery package for analysis of high-throughput sequencing data. GATK4 is fully open-source and is available at no cost for academic and commercial research on local computing infrastructure, and is also designed for deployment on cloud environments. The reengineered GATK solves key bottlenecks, including an analysis step where GATK4 can analyze types of genomic sequence data 15 times faster than GATK3, while increasing input capacity by a factor of five.

The toolkit is developed by a team of software engineers and data scientists at the Broad Institute’s Data Sciences Platform (DSP), who work directly with genomics researchers to ensure the GATK stays ahead of increasing demands for accuracy, speed, and complexity.

Broad Institute data scientists, GATK software engineers, and guest panelists will host a live-streamed launch event at 2pm EST today to discuss the expanded features, performance improvements, and cloud deployment of GATK4. The panel will include leaders from the University of California-Santa Cruz, the Harvard T.H. Chan School of Public Health, Yale School of Medicine, Intel, IBM Watson, Verily Life Sciences, Amazon Web Services, Cloudera, Alibaba Cloud, and Microsoft Genomics.

On January 9th, 2018, the Broad Institute Data Sciences Platform unleashed the latest version of GATK. GATK4 is completely open source.

Wider scope of analysis, machine learning, and a performance boost

GATK4 offers significant research advantages over earlier versions, which focused on germline short variant discovery only. GATK4 is the first and only open-source software package that covers all major variant classes (SNPs, indels, copy number, and structural variation) for both germline and cancer, and for genomes and targeted sequencing assays. In addition, GATK4 includes tools that take advantage of machine learning, including neural network algorithms, which improve accuracy in variant discovery.

Over the last 12 months software engineers at the Intel-Broad Center for Genomic Data Engineering have also incorporated major performance optimizations into GATK4.

“We completely reengineered GATK4 to optimize speed, scale, and flexibility, while maintaining the best practices pipelines and high quality of data output that have become the standard for genomics research around the world,” said Eric Banks, Senior Director of the Data Sciences Platform at the Broad Institute. “Already, GATK4 has been put to the test internally at the Broad Institute for six months — including by processing the 24 terabytes of sequence data our genomics platform produces and moves to the cloud every day.

“This major new version benefits from the combined experience of our team’s scientific and operational expertise running genomic pipelines at scale, as well as the engineering excellence of computational industry leaders,” Banks said. “Now, we are proud to freely share this toolkit with researchers everywhere.”

“Intel collaborated with the Broad Institute to completely rewrite GATK4’s core code for performance, flexibility, speed and scalability, with end-to-end pipeline scripts that can be run on any local or cloud compute infrastructure,” said Kay Eron, general manager of Analytics Industry Solutions at Intel Corporation. “Additionally, these improvements allow users to call new variant types for both germline and somatic analyses.”

Expanded and improved GATK4 features include:

  • somatic short variant calling with Mutect2, which combines a proven somatic modeling algorithm (the widely-used single nucleotide single nucleotide variant caller Mutect) with the haplotype-centric logic of the GATK's leading germline variant caller, HaplotypeCaller.
  • full discovery pipeline capabilities for somatic copy number variants (GATK CNV) using both coverage and allelic balance estimation. These pipelines are engineered to scale seamlessly from gene panels and exomes to whole genome sequencing (WGS).
  • advance versions of tools currently in development for structural variation discovery; germline CNV discovery using machine learning approaches; a functional annotation framework for both germline and somatic analyses; and a new pipeline for germline short variant filtering based on convolutional neural networks. (Learn more about deep learning in GATK4 in a December 2017 blog post)

Powered by collaboration

Reflecting the collaborative nature of large scale-genomic research today, engineers from Intel, Google, Cloudera, Microsoft Genomics, IBM, Amazon, and Alibaba all made significant contributions to the development and cloud deployment of GATK4.

For instance, as described in a January 2018 video, Intel's development of the GenomicsDB datastore dramatically improved the scalability of GATK's GVCF-based germline joint-calling pipeline, allowing researchers to run full variant calling analyses of larger data sets significantly faster than previous versions. (Learn more about how Intel contributed to the development of GATK4 in an Intel blog post.)

Thanks to the contributions of Cloudera engineers, GATK4 now uses Apache Spark for both traditional local multithreading and for parallelization on Spark-capable compute infrastructure and services, such as Google Dataproc. “It has been a privilege collaborating with the Broad Institute over the last two years to ensure that GATK4 can use the power of Apache Spark to make genomics workflows more scalable than previous approaches,” said Tom White, principal data scientist at Cloudera.

And in complementary work, engineers from Verily Life Sciences gave GATK4 the ability to stream data directly from Google Cloud Storage (GCS) through the NIO protocol, enabling considerable savings — including of time and financial resources — for cloud executions.

Users can run GATK4 locally or on the cloud

As announced in May 2017, the Broad Institute is making GATK4 available under a fully open-source BSD 3-clause license. Users can download the software on the GATK website or can run the tools via a cloud provider.

Researchers of all backgrounds, including those without computational training, can access GATK4’s cloud-based pipelines through FireCloud, the Broad Institute’s cloud-based analysis portal. These pipelines are fully configured and are ready-to-run on preloaded example datasets.

All users can access GATK4 through FireCloud free of charge, although cloud providers may charge their own fees for data storage and processing. Through a partnership with Google Cloud Platform, Broad is offering a $250 credit per user toward compute and storage costs for the first 1,000 applicants; interested users can learn more on the GATK website.

Live-streamed launch event on January 9, 2-4pm EST

The GATK development team and guest panelists will present key new features and highlights of GATK4 in a Facebook Live event held at the Broad Institute in Cambridge, Massachusetts and live-streamed at on January 9th, 2018 from 2pm to 4 pm EST.

Participants can ask questions and receive answers in real-time from the GATK team. See the GATK blog to learn more about the agenda and speaker lineup.