Letting the knowledge flow: Firehose, FireBrowse & FireCloud
It's been said that getting an education from MIT is like taking a drink from a fire hose. At the Broad Institute of MIT and Harvard a similar ethos prevails, and is particularly evident in the aptly named cancer analysis pipeline, Firehose. Conceived by Broad institute member Gad Getz’s Cancer Genome Analysis group and funded by The Cancer Genome Atlas (TCGA) program of the National Cancer Institute and the National Human Genome Research Institute, Firehose was developed through a fruitful partnership of software engineers and computational biologists. Firehose culls and analyzes massive quantities of data, feeding the global scientific community genetic information in an effort to systematize cancer research.
In genetic lexicon, the four letters “A,” “T,” “C,” and “G” correspond to the nucleotide bases that compose DNA, which are arranged into “sentences” (also known as genes), and translated into terminology the body can understand — helping to dictate traits from eye color to disease susceptibility. However, molecular spelling errors can prompt uncontrollable cell growth and contribute to cancer. During the TCGA’s pilot effort to identify those spelling errors in a variety of tumors, circa 2007 to 2010, it became evident that traditional approaches to biological research — where computer code and data might only exist on the laptop of a single scientist — were no longer feasible. Firehose emerged as a means to channel the results amassed during this project, expanding the endeavor from hundreds to thousands of samples and incorporating cancer types beyond ovarian, lung, and breast cancer.
In addition to sequencing and characterization centers, TCGA also funded seven Genome Data Analysis Centers (GDACs) including Getz’s GDAC, which continues to operate the Cancer Firehose pipeline. Today, Firehose contains perhaps the most well-categorized, integrated cancer dataset in the world, spanning more than 11,000 patients and 38 different types of cancer, in the form of roughly 80,000 sample aliquots.
Michael Noble, assistant director of data science for the Getz lab, leads the Firehose initiative at the Broad on behalf of TCGA. Given hose-related parlance, it’s appropriate that Noble uses liquid terms to describe the recent leap from “wet,” lab bench approaches to algorithmic ones, culminating in a true “watershed” moment for biomedical research. “The disparity of power between these two modes of inquiry is dramatic,” he explained. “It’s a privilege to witness the transformation of a scientific discipline from largely qualitative to digitized and quantitative.”
Though geneticists generally share a common scientific language to facilitate collaboration and share knowledge, differences still persist regarding modes of analysis and sample number. Noble compares one such disparity to the “Babel problem” — a nod to the Book of Genesis, in which those constructing an immense tower are prevented from further progress when divine intervention introduces multiple dialects and fractures the once unilingual community. During large and geographically diverse projects like TCGA, researchers might find contrasting trends in the data due to different sample numbers or subtle discrepancies in the “tuning” of their algorithms. As a result, Noble and others seek to translate varied datasets from 150 tissue donor sites and 20 TCGA centers into a universal language of zeros and ones. “You can’t make a claim about whether or not mutations lead to a certain outcome, for example, if you don’t have concordance between your sample counts,” he contended.
Using this stored data, Firehose executes a complex series of interdependent workflows, or runs, which include numerous computational algorithms such as GISTIC and MutSig. The results from each run are packaged into summary “Nozzle reports,” resembling condensed scientific papers and effectively streamlining the slew of numbers into a format easily understood by a wide audience.
Firehose also permits customization, allowing users to upload data and request analyses between scheduled monthly runs. “If you think of our analysis workflows as pipelines,” Noble explained, “this is a way to hook up a new pipe at a certain point in that workflow to get the latest available samples.” The Firehose database is powerful and substantive, but not what Noble would deem “pretty.” He said, “It took time and resources just to keep the beast running, so we didn’t have much of a chance put a ribbon on it.”
Enter FireBrowse, a free, “browse-able” interface of Firehose geared more towards the biologically — rather than computationally — minded consumer. Noble maintained, “FireBrowse gives you a compass to navigate the piles of data we produce with Firehose, as well as a tool to sculpt that data and ferret out the items of greatest interest.” Thanks to FireBrowse’s accessibility, principal investigators and academic departments need not hire additional staff just to begin sifting through TCGA results; a few clicks of the mouse is all it takes to transfer a concise data package to your analytic software of choice and systematically identify trends. “Clinicians can go to our site or other portals that use our results, and in a couple clicks gain a statistical sense of the potential outcomes for a patient sitting before them with a cancerous gene alteration,” Noble said.
But iCoMut remains FireBrowse’s true pièce de résistance. The “CoMut” stands for “mutation co-occurrence,” although the preceding “i" — signifying “interactive” — may represent the more telling portion of the name. iCoMut refines the 1,500 analyses generated via Firehose, enabling researchers to manipulate datasets in real time. Applied around the world, such features continue to advance the assault on cancer.
The Firehose/FireBrowse user demographic spans the academic, research, and commercial communities. For instance, the latter might employ FireBrowse to expedite data collection and mining. “With this tool, pharmaceutical companies can readily exploit TCGA advances while pursuing new compounds,” Noble asserted. “Although the results of an automatic processing system like Firehose will never be perfect, when performed upon every TCGA sample and made easily available they can be quite compelling.”
Given the sheer size of the human genome (roughly three billion base pairs), gone are the days of isolated experimentation and manual data assessment. “I’d like to think that Firehose and FireBrowse hasten the pace of discovery,” Noble noted, “and help disarm the longstanding stereotype that research software and processes lack the rigor, throughput, and flexibility of their commercial counterparts.”
The next phase of Firehose will support modifications to data analyses by users themselves, and involves relocating Firehose to the “cloud” as part of an NCI Cancer Genomics Cloud Pilot project, developed together with the Data Science and Data Engineering team at the Broad. True to form, this revision (to be completed by early 2016) has been dubbed “FireCloud.” As the pace of information dissemination accelerates, when it rains it pours. Noble and colleagues forecast heavy downpours to come, leading to fair skies for physicians, researchers, and patients alike.
Mermel, CH et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biology. 28 April 2011. DOI: 10.1186/gb-2011-12-4-r41.
Beroukhim, R et al. Assessing the significance of chromosomal aberrations in cancer: Methodology and application to glioma. PNAS. 11 December 2007. DOI: 10.1073/pnas.0710052104.
Lawrence, MS et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. Online 16 June 2013. DOI: 10.1038/nature12213.