Visualizing the cancer genome
Nico Stransky was getting frustrated. A computational biologist working in the Broad’s Cancer Program, Nico was trying to see patterns in the data from the recently sequenced genomes of 70 tumor samples from patients with head and neck cancer. In the study, scientists sequenced the exomes, or protein-coding, portions of the tumor genomes and analyzed the data to reveal mutations in a variety of forms that disrupt the “spelling” of genes in different ways. But the tables of mutation statistics that Nico was looking at could not tell him the full story. “Not only did I want to summarize these numbers, but I also wanted to go back to the data and see which patients had mutations in which genes,” Nico recalls. He needed a graphic that could depict individual patients’ mutations, rank these mutations by their significance, illustrate different kinds of mutations, and more. Since no such graphical representation existed, he decided to create one himself.
The result of Nico’s efforts is a new kind of figure, one that captures data on many types of mutations from the cancer genomes of dozens of patients and displays this information all in one place. (You can see a version of the figure in the recent Science paper on head and neck cancer here and read the related Broad press release here). Each column in the plot shows information for a single, anonymous patient’s tumor, and each row reveals a specific gene’s alterations in different patients. Nico calls the new plot a coMut map because it reveals the co-occurrence of mutations in samples (although some of his fellow researchers still affectionately refer to it as a “Nico-gram”).
You can see an example of the coMut map to the right (try clicking on the image to see the full size version). When I first saw this graphic, I was struck by how hauntingly beautiful it is. A wave of green and blue crests along the top of it, and colored rectangles seem to rain down the figure’s center. But as Nico and I looked more closely at the graphic, he showed me just how much information the image conveyed.
By displaying information in the plot’s main body and along all four axes, Nico generated a rich graphic that can reveal insightful patterns in the data. Along the right axis of the map is a ranked list of the most commonly and significantly mutated genes in head and neck cancer, beginning with TP53, a crucial tumor suppressor gene that is turned off in many forms of cancer. Following TP53 in the top row across the map, you can see that it is mutated in a number of different ways, represented by differently colored rectangles, in 62 percent of the more than 70 cancer samples surveyed. It also has the greatest number of mutations (displayed along the left axis) – close to 50.
Along the graph’s top axis, Nico displays the density of mutations for each patient’s tumor. The patient at the far left of the graph has the highest percentage of mutations – more than 20 mutations for every one million base pairs. Following this single patient down the chart shows which particular genes in this patient’s cancer sample are mutated. The bottom panel of the chart groups these mutations into categories – red indicates a change from a “G” to a “T”; yellow indicates a change from a “C” to a “G” or an “A,” and so forth.
This bottom panel reflects the underlying trigger of many cases of head and neck cancer: tobacco. Tobacco use leaves a unique mutation signature in the tumor genome, rich in “G” to “T” mutations. If Nico and I were to look at this spectrum for another kind of cancer – say, skin cancer, which is inextricably linked to UV radiation from the sun – we would see a different signature (more “C” to “T” transitions).
“This helps bring out some hypotheses about what is causing the mutations in a particular kind of cancer,” Nico explains. “Obviously, for lung cancer and melanoma, the carcinogen is known – tobacco and the sun, respectively – but one can imagine in another cancer, we might see a big imbalance between the kinds of mutations that occur. That could be due to a carcinogen that was not previously suspected.”
Nico’s map is devised as a visualization tool, not an analytical one. There are other researchers at the Broad led by Gaddy Getz who devote their time to analyzing the raw data that comes off of the Broad’s sequencers (read more about this team and their tools in a BroadMinded blog entry here). Nico works closely with this team to put the final polish on these results. “The map heavily relies on all of these other, analytical tools,” says Nico. “But in a sense, it’s the most concrete thing that we can see at the end of this process. When you see a list of genes or a list of significance, it’s less concrete.”
For Nico, creating figures is a hobby – he enjoys thinking about the best ways to represent data, and often chats with Bang Wong, the Broad’s creative director, about the figures he is devising. Nico works (and plays) in a programming language called R, which can perform statistics but also make plots and figures. He likens programming to cooking with a recipe. “There are a few cooking secrets,” he says, “but you play around and see what can be improved – what you can add to the figure without making it too confusing.”