A partnership in the name of reproducible research
In 1610, Galileo Galilei set a scientific precedent for the next half millennium: he published his notebooks. Sidereus Nuncius, as the publication was officially dubbed, documented Galilei’s observation of several astronomical features, including the orbit of Jupiter’s moons around the planet and the play of light and shadow on the earth’s moon. These various observations led Galilei to his then-radical conclusion that the earth orbits the sun instead of vice versa.
By transparently publishing his entire body of notes, raw and undecorated, Galilei ensured the credibility of his work. His pages consisted of images, text, and even a sort of rudimentary “code,” where circles represented planets and asterisks stars. Anyone who wanted to reproduce his efforts could easily do so—even now.
Fifty years later, Robert Boyle set another precedent when he established reproducibility as a cornerstone of the scientific method. But today this indispensible element of credible science has been falling short. A series of recent studies has shown that the majority of biological science is not reproducible.
“Right now there are more data and more tools,” said Michael Reich, director of informatics development for the Broad Institute’s Cancer Program. “So you’re multiplying combinatorially the number of connections between data and tools, increasing the chance—unless you do something purposeful to prevent it—that reproducibility is going to be a challenge.”
It’s that purposeful prevention step in which Reich’s team, which sits in the lab of the Broad’s Computational Biology and Bioinformatics director Jill Mesirov, is interested. A new grant to integrate two open-source tools—GenePattern and IPython Notebook — is the latest move in that effort. GenePattern was developed at Broad while IPython Notebook originated at the University of California at Berkeley and California Polytechnic State University, San Luis Obispo and is now developed as a broad academic and industrial open source project.
GenePattern is a program designed to both analyze data and record one’s methods. It wraps many genomic analysis tools in straightforward user interfaces that require no computational experience to operate. IPython (which will soon change its name to Jupyter in a nod to both Galilei and the various programming languages it supports) is a sort of modern-day version of Sidereus Nuncius. It allows computational scientists to combine code, text, mathematics, plots, and rich media into a single document.
The current standard is to include some of that content in the supplementary information offered with a research publication. “But it’s still not enough for someone else to reproduce what you did,” said Reich. “Even for people with computational experience, it takes a substantial amount of work to set up a reproducible analysis environment.”
IPython Notebook eases that workload. “It interleaves the discussion of a scientific experiment with the actual in silico experiment itself,” Reich said. And in this age of questionable reproducibility, the fact that IPython Notebooks are interactive is critical. “Readers” can download a version of the Notebook and use, manipulate, and expand on its code and data themselves. Soon, the platform will also allow users to work collaboratively on Notebooks, Reich said.
But in its current form, interacting with IPython Notebook still requires considerable computational savvy. Integrating GenePattern into its folds will make it accessible to the entire genomics research community, regardless of the user’s level of programming experience.
In this way, Reich said, “we’re bridging the gap between the biologist and the tools required to reproducibly answer biological questions without many of the impediments that currently hinder the pace of genomic research."
Developers in the Cancer Program are creating a host of tools that cover everything from new methods for genomic analysis to new ways of accessing and implementing those methods. As Reich put it, the team wants to do for bioinformatics analyses what the internet did for travel reservations; enabling technologies make things that used to be difficult—like buying airplane tickets—very easy, he said.
GenePattern allows users to drop in their data, and then select from a host of analytical tools to make sense of them—no coding necessary. Once GenePattern is integrated into IPython Notebook, users will do the same within a Notebook “cell” instead of writing code in that cell. A completed Notebook may have cells containing everything from GenePattern outputs to descriptive text to explanatory charts and images.
Because of its recording capabilities, Reich said, even those with significant programming expertise can find GenePattern—and its integration into IPython Notebook—useful. “Most analysis tools are designed to take some input, process it, and give you results,” he explained. “Remembering what was done is not their responsibility.” GenePattern tracks not just the inputs and outputs, but also the parameters and software versions that generate those outputs, providing an environment that remembers the details so the scientist doesn’t have to. These features will be ideal complements to the lab notebook-style presentation that IPython Notebook provides.
Tools like GenePattern free biologists up to focus on using computational tools while others, like Reich and his colleagues, work to develop them.
And while IPython Notebook might not be available 400 years from now, it at least allows a scientist’s contemporaries—regardless of computational skill level—to reproduce his or her work the same way Sidereus Nuncius did for Galileo’s.