Large-scale sequencing study reveals common variation in DNA structure

Differences in DNA among people can influence susceptibility to disease. Such differences exist at both the fine scale of single-letter variants known as single nucleotide polymorphisms (SNPs) and the large scale of structural variation. Until recently, studies of structural variation were performed with array-based technologies that offered only a low-resolution view of structural variation. But a new analysis of genome sequence data from the 1000 Genomes Project, led by researchers from the Broad Institute of MIT and Harvard and elsewhere, reveals thousands of structural variants in the human genome and offers researchers a rich, nucleotide-resolution map of structural variation that can aid in studies of human biology and genetic risk factors for disease. The findings appear in the February 3 online edition of Nature.

A decade ago, scientists recognized that some DNA changes were common in the human population and could be studied to reveal disease factors. With a focus on SNPs, they conducted large-scale studies to map these variants and built research tools like microarrays to scan the genome for SNPs with association to traits and disease.

But in 2004, a study led by Charles Lee, a clinical cytogeneticist at Brigham and Women’s Hospital, associate professor of pathology at Harvard Medical School, and associate member of the Broad, demonstrated that another type of variation was common in the human genome: structural variation. While SNPs are single-letter substitutions of one DNA base or “letter” for another, structural variants can be extra or missing DNA, known as duplications or deletions, insertions of DNA segments from one area of the genome to another, or inversions, in which a segment of DNA is flipped.

Since that initial glimpse of the prevalence of structural variants in the human genome, the largest component being DNA copy number variants, scientists have learned that this type of variation is more common than they initially thought. Although apparently healthy people are estimated to harbor thousands of these DNA differences, more than a dozen common structural variants have been linked to diseases, including autism, schizophrenia, and Crohn’s disease.

In 2008, an international, interdisciplinary consortium of scientists launched the 1000 Genomes Project, with the goal of generating the most comprehensive map of human genetic variation yet using next-generation sequencing technologies. Despite the name, the consortium now aims to sequence the genomes of 2,500 people by the end of 2012. Lee explained that at the project’s inception, consortium members decided to include structural variation in the study, as well as single-letter variants, because of the growing recognition of this type of DNA variation. A team, co-chaired by Lee, was formed to assess structural variation in the project data and was led by scientists at Brigham and Women’s Hospital, Harvard Medical School, the Broad Institute, Wellcome Trust Sanger Institute, the University of Washington, and the European Molecular Biology Laboratories in Germany.

In the 1000 Genomes Project’s pilot phase, completed in 2010, whole genome sequences were generated for 185 people. Over the past two years, the structural variation team held weekly meetings to strategize on how to identify variants among the massive amount of sequencing data from the pilot phase. “It’s been a very arduous task to say the least,” said Lee, who is co-senior author on the new paper. He explained that while SNPs are relatively simple variants, structural variation is more complex, ranging in size and form and requiring creative algorithms to discover. “We’ve been living in a SNP-centric world,” said Lee. “Structural variation is a whole new game and it’s so much more complicated. We quickly realized…that we had an enormous task ahead and had to be very innovative.”

The scientists developed 19 computer programs, or algorithms, that each approached the problem in a different way. In total, the algorithms identified 22,025 deletions in the human genome and 6,000 other structural variants. With an unprecedented number of whole genome sequences at their disposal, the team was able to discover variants that were less common and smaller in size than those known before. The algorithms were more successful at picking up some DNA changes than others. Another lesson the team learned is that none of the algorithms were perfect; some algorithms could detect variants that the others could not.

One factor that aids in finding structural variants is the depth of sequencing, or “coverage,” which represents the average number of times a DNA sequence is read by the sequencing technology. In the study, six genomes from two families were sequenced at high coverage (an average of 42 reads per DNA segment), and 179 individuals were sequenced at low coverage (an average of two to six reads per segment), or “low-pass.” Discovering structural variants is easier with higher coverage sequencing data, so the team overcame that obstacle in lower coverage sequencing data in a couple of ways. One was to combine the low coverage next-generation sequencing data with genotyping data from microarrays. Another was to leverage population-based genetic strategies that were possible from the large number of available sequenced whole genomes.

Broad researcher Bob Handsaker, co-first author on a companion study, explained that the team faced a challenge in scaling up their approaches and dealing with so much data. But by leveraging the power of population-scale data, they could discover variants in low-pass data. As co-senior author and Broad associate member Steven McCarroll explained, “[Our approach] made it possible for one algorithm to both find many more structural variants than other approaches, but at the same time find them in a very clean way that wasn’t distracted by things that are not actually structural variants.” One algorithm developed in this way by Handsaker, McCarroll, and other Broad scientists, known as Genome STRiP, was especially successful in using this strategy to locate deletions.

By applying Genome STRiP and 18 other ingenious algorithms to high-quality sequencing data, the structural variation group of the 1000 Genomes Project was able to pinpoint over half of the variants to single-letter resolution, enabling the creation of a richer map of genetic variation than was possible before. The increased resolution also gave clues of how these variants are formed, which may help scientists understand when the variants arose in the human population and why. The work also revealed clusters of variation, known as hotspots, in the human genome. “These insights have broad-based implications for biomedical research, including cancer research,” said Lee.

More variants remain to be found, but the new map of variation is already making waves. The Genome Reference Consortium at the NIH, which maintains the reference human genome, is now revising the sequence to include structural variants from the 1000 Genomes Project. Microarray makers are also including variants from this study to design new and improved genotyping arrays, which will enable studies of genetic risk factors for disease.

“This study represents the collective efforts of 57 experts in structural variation from 26 different institutions and hence represents the most comprehensive analysis of structural variation in any whole genome sequencing dataset ever published,” said Lee. “This resource should be useful for anyone comprehensively analyzing next generation DNA sequencing data for disease-causing genetic variants.”

Paper(s) cited

Mills et al., Mapping copy number variation by population-scale genome sequencing. Nature 470, 59-65. 03 February 2011. DOI: 10.1038/nature09708