Toward more reproducible sequence processing, and the $5 pipeline
As the already-huge volume of data generated by sequencing centers continues to grow, researchers' and data scientists' approaches to data processing have to adapt to make sure that accuracy stays high, and costs stay low. In a recent pair of posts on the GATK blog, Eric Banks, senior director of the Broad's Data Sciences Platform, outlined how several sequencing centers, including the Broad, have come together to make their analyses more reproducible. What's more, he noted, the Broad has made major strides over the last two years to make sequence analysis cheaper.
First, the accuracy question. While sequencing centers and laboratories all want the same thing out of their genomic data — an accurate list of genetic variants from which to generate new biological insights — the precise steps they take to achieve that goal (their analysis pipelines) can differ in subtle ways from center to center and lab to lab. Those pipeline variations arise from a host of considerations, such as local computing infrastructure, desired balance between cost and runtime, etc., and can sometimes generate slightly different results. Why is that important? Banks provided a hypothetical:
Imagine you want to find a causal variant in a sample you really care about, so you run variant calling and then compare the resulting callset against gnomAD in order to find the population-based allele frequencies. And, behold, you find a SNP that’s not in gnomAD! This could be the rare variant that you’ve been searching for… or it can be an artifact that arose because you didn’t process your sample with a pipeline that’s functionally equivalent to the pipeline used to make gnomAD.
Ideally, any given pipeline should produce equivalent results from the same starting set of sequence data. To get there, Banks wrote, a consortium of centers, including the Broad, have developed, agreed upon, and launched a "functional equivalence" specification for sequencing analysis pipelines:
[F]or the past year, we worked closely with several of the other large genome sequencing and analysis centers (New York Genome Center, Washington University, University of Michigan, Baylor College of Medicine) to develop a standardized pipeline specification that would promote compatibility among our respective institutions' pipelines. And I'm proud to say we accomplished our goal! It took a lot of testing and evaluations, but the consortium was able to define very precisely what are the components of a pipeline implementation from unmapped reads to an analysis-ready CRAM file that will make it “functionally equivalent” to any other implementation that adheres to this standard specification. This means that any data produced through such functionally equivalent pipelines will be directly comparable without risk of batch effects.
Apart from accuracy, another major factor in genome sequence analysis is cost. Not the cost of actually reading As, Cs, Gs, and Ts, but for the computing time and power (i.e., the compute cost) needed to stitch those sequences into order and highlight places where a given genome varies from a standard reference genome — the first step in comparing individuals' genomes.
Two years ago the GATK team started moving the toolkit's Best Practices pipelines to, and began optimizing them for, the cloud, so as to leverage the flexibility of cloud computing for faster, cheaper sequence analysis. As Banks noted in his second GATK blog post, those efforts have now borne fruit, in the form of a nine-fold reduction in compute cost, from $45 per genome in 2016 to $5 today:
Let me give you a real-world example of what this means for an actual project. In February 2017, our production team processed a cohort of about 900 30x [whole genome sequencing] samples through our Best Practices germline variant discovery pipeline; the compute costs totalled $12,150 or $13.50 per sample. If we had run the version of this pipeline we had just one year prior (before the main optimizations were made), it would have cost $45 per sample; a whopping $40,500 total! … [I]f we were to run this same pipeline today, the cohort would cost only $4,500 to analyze.
To learn more about the functional equivalence specification, or to find out how researchers and institutions can make use of the $5 compute pipeline via FireCloud, be sure to read the whole of Banks's posts.