GenePattern: Data choreographer

Leah Eisenstadt, May 11th, 2011 | Filed under
  • This heatmap shows expression, or activity, levels of genes (vertical
    axis) in various cell types (horizontal axis). The expression level of each
    gene is averaged across cell types; red boxes signal expression levels
    above the average for that gene, and blue boxes signal
    below-average levels.

Six years ago, a team of researchers at the Broad faced a challenge: researchers the world over were using microarrays – chips covered with microscopic fragments of DNA – to measure the expression, or activity, of genes, but the tools to analyze the data from these large-scale studies were often out of the direct reach of biomedical researchers. Their answer to this challenge — GenePattern — has since evolved into a multipurpose software package with uses in diverse scientific fields by researchers around the globe.

When designing their solution, the team of software engineers led by Michael Reich, Director of Cancer Program Informatics Development at the Broad, was thinking ahead. Dozens of new tools were being released every month, but each one had its own way of working with data. Because there were no standard ways for making tools interact, researchers were slow to adopt new methods developed by their computational colleagues. Also, if investigators wanted to reproduce their results, they had to rely on recording of their inputs and outputs on paper. “The tools of the time were producing new insights and generating new hypotheses,” Michael says, “but they couldn’t keep track of what you did.” So scientists wanting to recreate an experiment or compare data across studies faced difficulty in keeping the analyses consistent. Michael’s team created GenePattern, a suite of a dozen analytical tools that allowed researchers to create “pipelines” that would support their multi-step analyses and maintain a history of user activity that could be easily reproduced.

As technology has progressed, so has the need for more robust, flexible, and diverse capabilities. Today, GenePattern includes more than 150 analytical and visualization modules not just for gene expression data, but for many other biological techniques such as next-generation analysis of DNA sequence, proteomics, flow cytometry, and DNA sequence variation data, including analysis of single-letter changes and extra or missing sections of DNA. The software, which is made freely available to the scientific community, has more than 15,000 users worldwide, including many at the Broad Institute.

While GenePattern was initially designed to analyze gene expression data, its ability to shuttle data through individual analysis steps in a pipeline gives it utility in fields beyond genetics. In a sense, Michael explains, GenePattern is independent of the modules it’s running. It allows users to insert their own modules into the framework and create analysis pipelines. He likens GenePattern to a choreographer, directing the flow of data between modules. As team member Helga Thorvaldsdottir explains, “You can take the framework that choreographs all the analyses, plug in your own modules, and not even use the analyses we provide.”

That’s what happened when an MIT professor of graduate quantum chemistry discovered GenePattern. In his class, instructors spent much of the course teaching FORTRAN and the Unix shell – two programming languages with very steep learning curves – so students could run the simulation exercises in their problem sets. By wrapping the chemistry analyses in the GenePattern framework, they allowed the students to focus on the subject matter itself.

In addition to Helga and Michael, the core GenePattern team consists of Peter Carr, Barbara Hill, DK Jang, Ted Liefeld, Marc-Danie Nazaire, Jared Nedzel, and Thorin Tabor. These developers work with computational biologists and end users at the Broad and other collaborating organizations to make sure that GenePattern can support the rapidly changing needs of genomic researchers. The team has recently announced support for next-generation sequencing data, and will soon support running GenePattern in a cloud computing environment, in which data can be stored and analyzed on remote servers. “There is tremendous interest among research organizations and bioinformatics cores in seeing how the cloud can help them to manage their ever-increasing burden of data,” says Michael. With such a widely used tool that keeps expanding its repertoire of functions, the GenePattern team will have little trouble keeping busy.