Power in numbers
Earlier this month, we reported on a powerful analysis of next-generation sequencing data from the pilot phase of the 1000 Genomes Project that resulted in a rich, high-resolution map of structural variation — extra, missing, or rearranged DNA — in the human genome. That map, along with research tools created from it, will enable new studies of this kind of DNA variation in human biology and disease.
Scientists on the 1000 Genomes Project’s structural variation team — co-led by Broad associate member Charles Lee — wrote 19 computer algorithms, or programs, to identify these DNA differences among large numbers of sequenced genomes (185 were sequenced in the 1000 Genomes pilot). The most successful algorithm, known as Genome STRiP (Genome STRucture in Populations), was developed by a team of Broad researchers led by Broad associate member Steve McCarroll and is the focus of a new paper in Nature Genetics this week.
As Steve explains, when genome centers began producing individual sequenced genomes, such as that of double-helix co-discoverer James Watson, scientists took a technical approach to uncovering structural variation by looking for signatures of deletions or other variants in each genome individually. Genome sequencers don’t read the order of DNA nucleotides, or “letters,” in one long string, but in millions of short fragments, aptly called “reads.” When the reads don’t line up well with the reference human genome or occur more or less times than expected, they can signify that a chromosome may harbor a structural variant. For example, if a read seems to skip a section of the reference genome, it may signify that some of that chromosome is deleted.
The problem with these signatures of variation is they can lead to a high false discovery rate. Steve explains that the signatures can arise from true variants, but they can also arise from things that aren’t, and the algorithms routinely get tricked.
As the costs of sequencing plummet, it has become possible to sequence the genomes of large numbers of people, as in the 1000 Genomes Project. This deluge of data offered Steve and his fellow Broad researchers Bob Handsaker, first author on the new study, Joshua Korn, and James Nemesh a unique opportunity to hunt for structural variants in a new way. Steve says, “We started from first principles and asked if you had data from many different people, are there ways you’d analyze it differently than if you had data from one person.” As Bob explains, “We really tried to leverage the fact that we had data on a population level. It made our approach work well.”
The scientists built a computer algorithm that combined the traditional method of looking for signatures of variation in individual genomes with new, population-aware ideas that leveraged the power of numbers. In one of these new approaches, Genome STRiP searched for deletions found in many people due to shared ancestry among humans. “Most of the variation in any one person’s genome is present in other people’s genomes,” says Steve. Another component of the algorithm helped it avoid being “tricked” by false discovery, a problem potentially worsened by large-scale sequencing data. If a true variant is found, it should make some genomes different from others, so that people’s genomes can be distinguished by the presence or absence of that variant in their DNA.
Another idea employed in Genome STRiP is a basic one. The human genome contains two copies of each chromosome — one from the mother and one from the father. Because of structural variation, one copy of a chromosome could be intact, and another could have a missing section, a deletion. If a suspected variant is true and exists in the population — a deletion, for example — then a person’s genome could either have two copies of the version harboring a deletion, two copies of the normal version, or one of each. “It’s a simple idea, like all of these ideas,” says Steve, “but they were novel approaches and turned out to be profoundly enabling.”
Genome STRiP was the most successful algorithm in the study at finding deletions, and at the same time had the lowest rate of false discovery. With next-generation sequencing data, the scientists were able to map structural variants down to the single-letter level, producing a higher resolution map than was achievable before. Steve explains that the team expected to see more variants with better resolution, but they were also able to create a detailed physical map of the variants’ locations, something he says is a great resource for genetic studies. “It’s like if you’ve never been to Norway, you might expect to see fjords when you go there. But if you actually make a map of them, other travelers can go back and find specific ones using the map,” he says, adding that scientists can now use the variation map in studies for disease associations.
The algorithm was able to not only discover variants, but also genotype them in the population — determining, for each variant, whether a person has two intact copies of the chromosome, two copies harboring the deletion, or one of each. Steve explains that the ability to genotype structural variants and associate them with “marker” SNPs is empowering for future studies. Scientists can re-analyze data from studies that focused on SNPs to look for structural variants that might be risk factors for disease.
When the project began two years ago, none of the researchers had worked with next-generation sequencing data before. As team leader, Steve was impressed by first author Bob Handsaker’s ability to master a new field and create a truly innovative research tool. He says, “He’s gone from being the new kid to being one of the leaders in the field in terms of methods development in such a short time.” Bob explains that he enjoyed the collaborative nature of the 1000 Genomes Project. “It was a great opportunity to work cooperatively with structural variation researchers from all over the world,” he says. “I hope Genome STRiP will prove useful in many other studies.”
Paper cited: Handsaker RE, et al. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nature Genetics. 2011. DOI: 10.1038/ng.768.