I’ll take machine learning for $1,000, Alex

Haley Bridger, March 1st, 2011
  • Computational biologist David Logan uses CellProfiler to analyze a
    multitude of cells within minutes.
    Photo by Len Rubenstein

Millions of viewers tuned in to Jeopardy to see IBM’s artificial intelligence software “Watson” compete against two of Jeopardy’s most celebrated champions. Despite a few amusing follies (for a Final Jeopardy clue about U.S. cities, Watson answered, “What is Toronto?????”), Watson beat his opponents quite handily. The three-night event was an opportunity for IBM to showcase a real marvel in machine learning – “Watson” is not only capable of processing natural language (the ordinary language we humans use in everyday conversation, as well as the pun-filled, crossword-puzzle-like clues of Jeopardy), but is also able to learn from its mistakes and therefore adjust its future answers.

Watson, however, is not alone in these abilities. Researchers the world over are developing software and algorithms that can sift through human language, learn from errors and improve performance, or help humans narrow in on answers buried in vast amounts of data. At the Broad Institute, two groups have been developing algorithms that may be Watson’s cousins: GRAIL and CellProfiler Analyst. You won’t see them on Jeopardy any time soon, but their names have popped up in a number of journal articles recently and they are helping researchers at the Broad and beyond make new discoveries.

GRAIL: Searching for gene relationships

Hundreds of thousands of journal articles about disease genes have been published over the last few decades. Sifting through this vast array of text to find genes that may play a critical role in a particular disease would be a daunting task for a human, but GRAIL (Gene Relationships Across Implicated Loci), a computer program created at the Broad, is up to the task.

Last year, Broad researchers studying rheumatoid arthritis (RA) used GRAIL to scan 250,000 published abstracts, or summaries, from papers written over the last four decades (read more here). GRAIL cleverly mined the academic literature, helping the researchers come up with a list of promising candidates to focus their efforts on and connect these genes to pathways or mechanisms that may help explain the cause of RA.

Soumya Raychaudhuri developed the GRAIL program while he was a postdoctoral researcher at the Broad Institute. "One of the big challenges for technologies like GRAIL is to get [them] to move beyond the abstracts and get to the full text. That’s one of my long-term goals," he told us at the time of the RA study. "But the hope of all of this is that it will help us figure out what the key pathways are that seem to predispose people to disease. And hopefully with that, we’ll be able to identify drugs that are effective."

You can find out more about GRAIL here.

CellProfiler Analyst Classifier: Training a machine

At the Broad Institute, researchers also rely on programs that can interpret images instead of language. The Imaging Platform develops and uses an open-source software called CellProfiler Analyst to look through images of cells that have been treated with chemical compounds or genetic perturbations, some of which may stop cells from growing, change their shape or behavior, or elicit another type of visible response. CellProfiler can measure the size, shape, and the intensity or texture of a stain (used to distinguish different compartments within a cell). But being able to recognize peculiar-looking cells is a complex problem – a biologist can look at a cell and immediately determine that it looks either normal or abnormal, but a computer program has to rely on a combination of measurable features in order to assess a cell, and it may not immediately be able to guess correctly. But once the program is “taught” what to look for, it can process thousands of images quickly and accurately.

Broad researchers have developed a way to train the program and improve its abilities. In the beginning, the computer program and the biologist work together, with the biologist “teaching” the program which cells are noteworthy. The biologist shows the computer examples of cells that have a particular appearance that the biologist finds interesting for research. After some training, the program shows the biologist some potential examples of more cells like those it has just been shown. The biologist corrects errors, until the computer “learns” what the biologist is looking for. Once the program has been trained, it can score billions of cells from an experiment in a few minutes, identifying cells of interest. This points the biologists’ attention directly to the most promising samples, which might lead to uncovering the causes of, and cures for, a variety of diseases.

“[CellProfiler Analyst] is used at the Broad and around the world primarily to sift through tens or hundreds of thousands of chemical or gene perturbants to identify those with relevant effects on cell populations,” says Anne Carpenter, director of the Imaging Platform. “Plus, it’s wicked fun software to use.”

You can find out more about the Imaging Platform and its open-source software tools here.