Behind the scenes of The Cancer Genome Atlas: part 1

In this two-part series, we’ll give you a look at some of the tools, teams, and resources built by Broad Institute scientists to support the large-scale cancer sequencing project known as The Cancer Genome Atlas (TCGA). Two years ago, members of the TCGA team faced a daunting, yet exciting...

The TCGA team drinks from the firehose of large-scale cancer sequencing
The TCGA team drinks from the firehose of large-scale cancer sequencing

In this two-part series, we’ll give you a look at some of the tools, teams, and resources built by Broad Institute scientists to support the large-scale cancer sequencing project known as The Cancer Genome Atlas (TCGA).

Two years ago, members of the TCGA team faced a daunting, yet exciting, challenge. The team, led in part by Broad Institute scientists, had recently completed the first project of the TCGA’s pilot phase, in which they sequenced 600 genes in 200 samples of glioblastoma, a deadly brain cancer, and revealed new mutations and core pathways involved in the disease. While planning the project’s second pilot phase, aimed at ovarian cancer, the team realized that simply scaling up that approach, even ten-fold, would fall short of the project’s ambitious goals.

“TCGA had this mission to create the most comprehensive dataset of cancer samples, and it was recognized at that point that six hundred genes wasn’t enough to really say we’ve sampled the genomic landscape of ovarian cancer,” says Carrie Sougnez, project manager for cancer genome projects at the Broad and a coordinator of the TCGA. Fortunately, “next-generation” sequencing technology was advancing quickly and made it possible to rapidly and cheaply generate greater volumes of data. These new capabilities allowed the researchers to broaden their scope and sequence not just a few hundred genes, but thousands of genes or even all the protein-coding segments of the ovarian cancer genome, known as the exome, in hundreds of tumor samples.

The opportunity to study the entire exome in so many cancer samples was unprecedented and presented the team with plenty of challenges, from laboratory techniques to analytical methods. “It was one of my favorite projects I’ve worked on,” Carrie says. “We didn’t know how to do something, and it was a high priority to make it work while at the same time processing real samples. We were very invested in it, with a lot of resources and a lot of smart people working on it.”

To get exome sequencing off the ground, scientists at the Broad assembled an interdisciplinary “exome taskforce” of experts in analysis, informatics, sequencing technology, and molecular biology. The taskforce met every two weeks to review recent data and improve the technique.

To sequence only the exome of a genome, scientists must first capture the exons with a technique called hybrid selection. Initially, the Broad Illumina sequencing team, led by Sheila Fisher, assistant director of Illumina technology development for the institute’s Genome Sequencing Platform, was tasked with scaling up the approach and making exome sequencing a robust and reliable production activity. “The process is complicated and involved and at the time no other center was trying to do this at the same scale we were. Initially the technology was inefficient at targeting the genes and we got too much of the ‘background’ genome we didn’t want,” says Carrie, “but the exome taskforce and the lab teams backing it up got that under control by carefully reviewing the ongoing development exome data and together deciding what process improvements might help.”

The team also made measurements that helped them assess and improve quality in the samples or the methods – even the temperature of water baths in the laboratory influenced the sequencing efficiency. “That was really a testament to the troubleshooting ability of [Sheila’s sequencing team] to figure that out,” says Carrie. “It’s much more robust now – thousands of samples have been and are continuing to be processed with amazingly consistent results.”

Project managers Lauren Ambrogio and Sara Chauvin coordinated the moving of samples from the Biological Samples Platform to the sequencing platform and ensured that the samples were properly sequenced. During the ovarian cancer project, software engineer Tim Fennell developed the Picard pipeline, a software tool that helps align reads from the sequencing machines to the reference genome and turn them into standard files that can be shared with the TCGA community, a network of one program office and 19 research centers including the Broad. Carrie says Tim’s work was extremely valuable, in part, because it included several quality checks that helped the team identify and address problems in the data.

The Broad also contributed data on copy number changes and expression levels in the cancer samples, an effort led by Rob Onofrio, a project manager in the Broad’s Genetic Analysis Platform.

The ovarian cancer project was one of the Broad’s first large-scale next-generation sequencing efforts in cancer. Analyzing this new resource of data became a huge task for the Broad’s computational biologists, led by Gaddy Getz, director of Cancer Genome Computational Analysis. To identify mutations in the sequencing data, computational scientists developed new analytical tools including MuTect, which identifies somatic mutations, written by Kris Cibulskis; MutSig, a tool written by Mike Lawrence that identifies significantly mutated genes; dRanger, a rearrangement detector, also written by Mike; and Indelocator, a tool that looks for somatic insertions or deletions, written by Andrey Sivachenko.

Tomorrow we’ll give you part two of this series: an inside look at the analysis management system, Firehose, built by Broad scientists to handle the flood of data and analytical code from the TCGA.

Other members of the Broad community contributing to the work include Julie Ann, Kristin Ardlie, Will Armstead, Jennifer Baldwin, Toby Bloom, Peter Carr, Scott Carter, Andrew Cherniack, Lynda Chin, Cassandra Crawford, Andrew Crenshaw, Daniel DiCara, Sheli Dookran, Nils Gehlenborg, Stephanie Grandbois, Supriya Gupta, D. Hubbard, Marcin Imielinski, Rui Jing, Sharon Kim, Zach Leber, Semin Lee, Pei Lin, Spring Liu, Yingchun Liu, Susan McDonough, Aaron McKenna, Craig Mermel, Jill Mesirov, Paula Morais, Marc-Danie Nazaire, Huy Nguyen, Michael Noble, Nathaniel Novod, Peter Park, Richard Park, Mark Puppo, Alexis Ramos, Michael Reich, Jim Robinson, Keenan Ross, Gordon Saksena, Erica Shefler, Sachet Shukla, Raktim Sinha, Peter Stojanov, Alvin Tam, Kristin Thompson, Helga Thorvaldsdottir, Roel Verhaak, Douglas Voet, Cherelle Walls, John Walsh, Jane Wilkinson, Wendy Winckler, CJ Wu, Terrance Wu, YongHong Xiao, Hailei Zhang, Jianhua Zhang, Juinhua Zhang, Lihua Zhou, Andrew Zimmer, and Robert Zupko. In addition to the Cancer Program, members of the Biological Samples Platform, Genome Sequencing Platform, and Genetic Analysis Platform contributed greatly to the success of this work.

Read more about the discoveries made in the TCGA’s pilot project on ovarian cancer.