Behind the scenes of The Cancer Genome Atlas: part 2

Yesterday on the blog , we introduced you to some of the Broad researchers who built tools, teams, and resources to generate and analyze a massive flood of data and analytical code for The Cancer Genome Atlas (TCGA). Today we give you a look at the system they built to manage data analysis for the...

Yesterday on the blog, we introduced you to some of the Broad researchers who built tools, teams, and resources to generate and analyze a massive flood of data and analytical code for The Cancer Genome Atlas (TCGA). Today we give you a look at the system they built to manage data analysis for the project: Firehose.

In addition to analyzing data generated here, the Broad serves as a Genome Data Analysis Center, led by Gaddy Getz, director of Cancer Genome Computational Analysis, and Broad senior associate member Lynda Chin, which coordinates data generated by other research centers in the TCGA. Even during the TCGA’s first pilot project on glioblastoma, the Broad team recognized the need for a robust analysis management system to handle the flood of data and algorithms. The phrase “drinking from the firehose” aptly describes the scale of the challenge, so the researchers named their solution “Firehose.”

“There are 20 centers in the TCGA, each with its own role and perspective,” says software engineering manager Mike Noble of the Cancer Program, who oversees Firehose, built during the ovarian cancer project by a team led by primary software developer Douglas Voet. “This leads to a pretty significant challenge for coordination and data standards. So we wrestle with this on a daily basis.”

Firehose addresses several issues, one of which Mike calls “the Babel problem.” On a massive, multi-institute effort like the TCGA, scientists often have trouble coordinating. “Everyone’s not really speaking the same language with respect to data,” he says. Firehose “versions” the analytical code and the data, keeping snapshots of each as they evolve so that researchers can efficiently collaborate and reproduce experiments. “We need to be able to say on Thursday, we ran version X of the code on version Y of the data, and this is the result. Because that’s the hallmark of all science: reproducibility.”

Firehose incorporates another Broad-built software package, GenePattern, which shuttles data through analytical modules. Firehose tells GenePattern which analytical codes to run on which set of data, serving as a bookkeeping system for the data. As Mike explains, “Firehose drives GenePattern.” But it’s more complex than that. “Really, the Firehose pipeline is a metapipeline of pipelines.” Some of the modules within the system, such as GISTIC, which looks for driver mutations in cancer, are themselves pipelines that have taken years to develop, and sometimes continue to be developed. “You can imagine this is a really complicated problem to solve,” Mike says. “The codes themselves are evolving underneath your feet as you’re trying to run it.” Software engineers on Mike’s team work to keep the pipeline stable while computational biologists tinker with analysis codes.

Firehose has been operational for less than a year, and went live near the end of the ovarian project. Before then, TCGA scientists analyzed their data in a more ad hoc manner, storing it in local files and relying upon their note-taking skills to keep track of data and code versions. Scientific results that took two to three years to discover and iteratively refine for the ovarian work can now be replicated within two to three days through Firehose, greatly accelerating the pace of research for the TCGA’s full phase targeting more than a dozen cancer types. “Firehose has really evolved its role in the TCGA,” Mike says. “It’s become something people are starting to really rely on, because it does this stuff reasonably well.”

For Mike and others on the TCGA team, the monumental effort serves a noble goal: to better understand cancer and pave the way for new treatments. “Hopefully the point of all this is you take raw data and turn it into discovery,” he says.

Other members of the Broad community contributing to the work include Lauren Ambrogio, Julie Ann, Kristin Ardlie, Will Armstead, Jennifer Baldwin, Toby Bloom, Peter Carr, Scott Carter, Sara Chauvin, Andrew Cherniack, Kristian Cibulskis, Cassandra Crawford, Andrew Crenshaw, Daniel DiCara, Sheli Dookran, Tim Fennell, Sheila Fisher, Nils Gehlenborg, Stephanie Grandbois, Supriya Gupta, D. Hubbard, Marcin Imielinski, Rui Jing, Sharon Kim, Mike Lawrence, Zach Leber, Semin Lee, Pei Lin, Spring Liu, Yingchun Liu, Susan McDonough, Aaron McKenna, Craig Mermel, Jill Mesirov, Paula Morais, Marc-Danie Nazaire, Huy Nguyen, Nathaniel Novod, Rob Onofrio, Peter Park, Richard Park, Mark Puppo, Alexis Ramos, Michael Reich, Jim Robinson, Keenan Ross, Gordon Saksena, Erica Shefler, Sachet Shukla, Raktim Sinha, Peter Stojanov, Andrey Sivachenko, Carrie Sougnez, Alvin Tam, Kristin Thompson, Helga Thorvaldsdottir, Roel Verhaak, Cherelle Walls, John Walsh, Jane Wilkinson, Wendy Winckler, CJ Wu, Terrance Wu, YongHong Xiao, Hailei Zhang, Jianhua Zhang, Juinhua Zhang, Lihua Zhou, Andrew Zimmer, and Robert Zupko. In addition to the Cancer Program, members of the Biological Samples Platform, Genome Sequencing Platform, and Genetic Analysis Platform contributed greatly to the success of this work.

Read more about the team behind the Broad's TCGA effort in part 1 of this series.

Read more about the discoveries made in the TCGA’s pilot project on ovarian cancer.