1000 Genomes Project releases pilot studies data set

By Alice McCarthy, Broad Communications, June 25th, 2010
  • ©iStockphoto.com/Penfold

Results from the first three pilot studies of the 1000 Genomes Project, an international gene sequencing effort designed to capture variations in the human genome, were publicly released on Monday, June 21. The project is the first of its kind. When complete, this self-titled "Deep Catalog of Human Genetic Variation" aims to fully document the genetic changes found with 1 percent frequency or more in human populations.

The project strategy is to sequence the genomes of people from differing ethnic/geographic regions to yield a clearer picture of human genetic variation. To do this, researchers home in on identifying single nucleotide polymorphisms (SNPs), the simplest form of DNA change, as well as structural variants within the human genome.

The DNA sequence of any genome consists of series of four fundamental building blocks - nucleotides. A SNP is a single nucleotide variation in the DNA sequence, a point mutation. SNPs vary between human geographic and ethnic groups - and between people with and without a certain disease - so they are useful in tracking genetic diversity and disease relationships.

In this first phase of the effort, researchers sequenced approximately 200 genomes to varying depths. Ultimately, however, researchers expect to catalogue the genomes of 2,500 individuals from 27 populations globally to fully meet the project's goals.

"The project has released about 9 million new SNPs we did not know about before," says Stacey Gabriel, Ph.D., of the Broad Institute of MIT and Harvard, a member of the 1000 Genomes Project consortium. Of the 18 million SNPs publicly available, nearly half have been contributed from the pilot studies of the 1000 Genomes Project. As with all results from the project, results are publicly available at the 1000 Genomes Project website.

Complete genome sequencing requires that an individual's DNA be sequenced about 30 times (30X). Less coverage provides less precision on an individual sample basis. However, current costs prohibit deep (30X) sequencing of all the samples in this study. Instead, researchers tested alternatives to find reliable, robust yields.

The data made available this week are part of a multipronged sequencing strategy. In these three studies researchers performed:

  • Whole genome sequencing of 180 control subjects with low (4X) coverage.
  • Whole genome sequencing of 6 people - two mother/father/adult child trios-with deep coverage (20-60X).
  • Targeted sequencing of the protein-coding regions of 1,000 genes of 700 individuals with deep coverage.

"The three approaches-deep and shallow whole genome sequencing plus gene sequencing-were used because at the outset of the project it was not clear how much sequencing coverage would be necessary per individual to generate high quality data," explains Gabriel. "This type of combined analysis is informative about how to design studies going forward as we sequence individuals in the study of diseases," she adds.

Producing deep whole genomes for a few individuals, particularly families, was informative for creating gold standard SNP sets that the researchers used in developing analytic methods for detecting variation. Says Gabriel, "Comparing the low pass sequencing on the bulk of individuals with the deep sequencing allowed us to set strategies for completing the main part of the project."

Deep sequencing (50X) of the protein-coding regions of 1,000 genes was done to shed light on the 2 percent of the human genome composed of genes. "The bulk of what we can functionally interpret in human genetics and cancer genetics is found in the protein-coding portion of the genome, so this part of the project was useful to us in pioneering targeting methods to sequence just the genes," says Gabriel, director of the Genome Sequencing and Analysis Program, and co-director of the Genome Sequencing and Genetic Analysis Platform and the Program in Medical and Population Genetics at the Broad.

The 1000 Genomes Project was initially founded in 2008 by the National Human Genome Research Institute (NHGRI), the Wellcome Trust Sanger Institute, and the Beijing Genomics Institute. Under sponsorship of the NHGRI, the Broad Institute of MIT and Harvard has been a major driver of the project. David Altshuler, deputy director of the Broad, co-chairs the overall effort and Stacey Gabriel leads the project's data production group.

The 1000 Genomes Project has major sample collection, data collection, and analysis components. "This is the first large-scale, next-generation sequencing project that has been done in human genetics so we first needed to learn how to work with the incredibly large data sets," says Gabriel.

Led by Mark DePristo, the Broad's Medical and Population Genetics analysis group designed a fundamental programming framework for extracting and manipulating the data in a straightforward, efficient way. "One of the major contributions to this project has been the development of general-purpose tools for working with next-generation sequencing data," explains DePristo. Current sequencing technology has provided an exponential boom in the amount of information provided, compared with earlier technologies. But this has complicated the analysis of the data.

To mine the data, DePristo and his Broad colleagues pioneered a programming architecture for working with large-scale data sets that provides a divide unseen by the user between the engine for working with the data and individual analysis tools. "We have been able to apply very sophisticated computational approaches to access and manipulate the 7.3 terabytes of data generated so far in this project," explains DePristo.

The medical and research communities at the Broad and beyond are already benefiting from the 1000 Genomes project because many of the details related to performing next-generation sequencing analyses were resolved in the pilot projects. "Now, we can do the data processing of our medical genetics projects with much less effort because so much of that was done in the 1000 Genomes Project," says DePristo.

In the coming months, the consortium will publish a primary paper on the pilot studies along with several companion papers.