What is Firehose?
Firehose is an analysis infrastructure developed at The Broad Institute to coordinate the flow of terabyte-scale datasets through dozens of quantitative algorithms.
Implemented primarily as a Java web application, the chief aims of Firehose are reproducibility, automation, and high throughput. To achieve the first of these goals Firehose version-tracks samples, algorithms, and the logic which binds them together, storing this and related execution information in a persistent RDBMS for provenance; this makes it easy for users to identify, over arbitrarily large time spans, what has been run, what can be run, what needs to be run again if data or algorithms change, or what cannot be run due to unsatisfied dependencies.
Automation and throughput are facilitated in several ways: First, while Firehose is most often used interactively from a browser, it also offers an an extensive API for programmatic control of routine tasks and scalability to multiple workspaces and datasets. Next, by encapsulating data and algorithm parameters within abstract annotations, instead of only literal values or explicit file system references, Firehose is able to execute analyses in data-blind manner across a wide variety of inputs, without modification or onerous bookkeeping for end users. This has proven to be a powerful metaphor for interacting with TCGA data, for example, because once an algorithm is in Firehose it can run on either a single tumor type or all of them with equal ease. Third, encapsulating jobs within an abstract execution engine (presently GenePattern, but support for others is under development) enables them to be transparently dispatched to a single machine or across many compute nodes, again without algorithm modification or extensive user tuning; in-depth knowledge of the underlying operating system or HPC task scheduler is not required, as entire analysis workflows for all samples of interest can be executed at the click of a button or via one API call. Finally, provenance stored for reproducibility also increases throughput by allowing new job requests that match previous requests to be completely avoided; in these cases Firehose simply returns the prior results.
Although still evolving, Firehose has become a valuable piece of The Broad Institute computing infrastructure; it is used daily by dozens in the Cancer Genome Analysis group, to perform all TCGA GDAC and GSC analyses, managing hundreds of thousands of jobs on tens of thousands of samples, spread over hundreds of compute nodes and a 400+ TB file system.