Harnessing the flood: Scaling up data science in the big genomics era
“DNA Sequencing Caught in Deluge of Data.” For geneticists and computational biologists, headlines like this — referring to the fact that genomics’ sequencing capacity is outstripping its data processing and analysis tools — sound all too familiar. (This particular one comes from a New York Times story published nearly six years ago).
And rather than letting up, the deluge is growing, such that some have even asked whether “genomical” should be the new “astronomical.”
“Genomic datasets double in size roughly every eight months,” said Eric Banks, senior director of the Broad Data Sciences and Data Engineering (DSDE) group. “It’s a level of growth that for years has far exceeded Moore’s law.
“And in some ways, that’s a good thing,” he continued, “because the world is full of really hard, medically relevant scientific problems that we can solve only with petabytes of data.”
Genomics is far from floundering in the flood. If anything, the field is evolving to meet the challenges, which themselves arise from what some have described as biology’s metamorphosis into a “big data” domain.
And indeed, as Jon Bloom — a mathematician and computer scientist in the Broad Institute’s Stanley Center for Psychiatric Research and Massachusetts General Hospital’s Analytic and Translational Genetics Unit — points out, genomics may have entered what the late Microsoft computer scientist Jim Gray called science’s fourth paradigm, ‘data-intensive science.’
“Instead of starting with a hypothesis and collecting a little bit of data to test it,” he explained, “in this paradigm you measure everything and run computational experiments on those data.”
This shift is fueling a burst of creative effort in computational methods and data engineering. Data scientists at the Broad and elsewhere are taking a hard look at the infrastructure they and their colleagues in other big data sciences and industry use on a daily basis. What they see is causing them to ask, What will it take to scale our tools and approaches to match the deluge of the future, and of the now?
“Fat” data, and lots of it
Before researchers can start exploring comes the nitty-gritty work of data processing, running raw As, Cs, Ts, and Gs coming off sequencing machines through several algorithmic quality control and analysis steps. Strung together in “pipelines,” these steps generate the lists of gene, exome, or genome variants and genotypes that fuel further science.
Geraldine Van der Auwera of DSDE’s Genome Analysis Toolkit (GATK) team highlights three overarching trends that stress the hardware and software of data production:
- Scientists incorporate sequencing in more studies, and collect orders of magnitude more samples in those studies, than in years past. The sheer volume puts a premium on pipeline automation and robust engineering.
- Researchers are squeezing more data from individual samples (e.g., whole exomes, whole genomes, whole transcriptomes) than ever before. A typical whole genome BAM file (a standard sequence data type) exceeds 100 gigabytes. “The data have gotten fat,” Van der Auwera chuckled.
- Sequencing is rapidly becoming a clinical commodity, adding new emphasis for speed. “In a research environment, it’s not such a big deal if it takes a few days to run a pipeline,” Van der Auwera said. “If you want to diagnose a patient, delays like that aren’t okay.”
One way of channeling the flood is for developers to make use of parallelization, engineering pipelines that can tap many computer processing cores simultaneously. “Some steps even get parallelized within pipelines,” Van der Auwera explained. “It distributes memory and processing demands such that the whole process runs more efficiently.”
Stepping that solution up a notch, moving pipelines from “on premises” computer clusters to external cloud computing services opens up an attractive combination of flexibility and scalability (a.k.a., elasticity). It’s a mix that helps address frequently fluctuating processing needs and helps keep institutions from having to buy and maintain more servers than they really need so as to meet spikes in demand.
“Processing a single genome in a reasonable amount of time takes at least a small server with a few cores,” Van der Auwera said. “For hundreds or thousands of samples, you need so much memory and so many cores.
“With the cloud,” she continued, “you can request a large number of cores for an hour, and dial back to just a few cores for the next few hours. And you can do it with little advance warning.”
Boldly going where no data have gone before
Variant lists and genotypes themselves become fodder for data exploration — the hunt for trait or disease associations, new questions, and/or new treatment opportunities.
“While it's critical to scale tried-and-true mathematical models to bigger data, we also need to do exploratory data science at scale to find computational footholds in research questions for which no best practice models exist,” said Bloom, one of the founders of the Hail project, an open-source effort to build a scalable framework for exploring and analyzing genetic data. “ By exploring data together, computational and biomedical experts can find those footholds so that computers can help us all learn from the data.”
The main bottleneck in data exploration, he continued, is what he described as latency, “the time it takes to go from formulating a computational experiment to implementing it as code, running that code on existing data, and getting back results that can suggest the next computational experiment or data to collect.”
With genomic data growing far faster than CPU performance, computational experiments’ latency has exploded. “We’re well past talking about the difference between a computation taking a millisecond or a second,” Bloom said. “In many cases, it’s between a computation taking a second, a day, or never completing at all.”
Many innovations from industry and other big data fields like physics and astronomy have, until recently, largely passed biology by. But that itself creates an opportunity, according to Bloom: it means that many of the big-data computing problems genomics now faces have already been solved, especially with respect to distributed computing.
“The idea is to understand the needs in genomics and biology, evaluate the open source tools out there, and see which might make the most sense to apply to our questions and data,” Bloom said. “And then adopt, modify, or build anew as needed.”
Take, for instance, one common efficiency-promoting feature baked into other fields’ computational tools: bringing the computation to the data, not the other way around.
“The challenge is to write code and algorithms that minimize how much you have to shuffle data between computers, because communication is much slower between computers than within,” Bloom said. “Moving to a model where data sit in the cloud alongside the computing tools that run against them could bring centers and consortia significant benefits.”
As with data production, the final ingredient for efficient exploration is scalability that balances resource use and speed with cost. “Whether a computation takes 10 hours using one core, one hour using 10 cores, or one minute using 600 cores, it should cost as close to the same as possible,” Bloom said. “We should give researchers a knob to dial up more computing resources and get their results back instantly with a negligible increase in cost.”
In an ideal world, genomics researchers and computational biologists would be able to easily access large amounts of data in the cloud, express complex computational models with concise, readable code, run those models on those data where they are, and get back results quickly and cheaply.
In this context, the data deluge actually presents an opportunity, not a challenge.
“More data doesn’t have to mean more statistical power from existing models,” Bloom said. “It can also mean a chance to use advances in statistics, machine learning, and engineering to ask fundamental questions that we otherwise couldn’t.”