A cast of 1000 genomes
Nearly every week, new genomes are welcomed into the vast annals of modern science. Indeed, genomic research is moving at an ever-increasing pace, as the machines that decode — or “sequence” — DNA churn out data faster and more cheaply than ever before.
In that light, today’s announcement by the 1000 Genomes Project, an international public-private consortium, might seem like “yet-another-genome” or, more aptly, a few hundred of them. In the October 28th issue of Nature, consortium scientists unveil the their latest results.
But the project’s goals run far deeper than merely decoding lots of genomes. In addition to kicking the tires of so-called “next-generation” technologies for sequencing DNA, the project seeks to create a comprehensive map of human genetic variation.
The phrase “human genetic variation” refers to the notion that the genomes of different people, while remarkably similar, are, in fact, not identical. They contain differences in the order, or sequence, of “letters” — As, Gs, Cs, and Ts — that make up the genetic code. While those differences reflect a pretty small slice of the total genomic pie, they hold important clues about the connections between genes and common diseases like cancer, diabetes, schizophrenia, and many others.
Unearthing these connections depends on a deep knowledge of all the myriad ways the human genome varies. For instance, genetic variants come in different types (from small to very large) and frequencies (from the most common to the very rare). In an ideal world, the best way for scientists to gather this knowledge is to decode and compare the full genome sequences of many people from diverse geographic regions. A task of such magnitude has only recently become feasible.
If some of this sounds vaguely familiar, that’s because the roots of the 1000 Genomes Project were laid down over the last few years, thanks to two critical tools: a first-generation map of human genetic variation called the HapMap, which was first made available in 2005, and an approach known as GWAS (short for Genome-Wide Association Study). Using these tools, researchers here at the Broad and elsewhere throughout the world have identified hundreds of sites in the human genome that are associated with diseases like type 2 diabetes, Crohn’s disease, rheumatoid arthritis, heart disease and many others.
Yet even with these successes, the scientists recognized their tools were incomplete. For example, the HapMap detected only the most common variations and primarily one type of variant — single letter changes in the genetic code known as SNPs (which stands for single nucleotide polymorphisms). It was clear there was much more genomic wilderness yet to be combed for connections to human disease.
Enter the 1000 Genomes Project. Launched in 2008, the consortium, spanning multiple scientific disciplines, organizations, and countries, set out to create the most detailed picture yet of human genetic variation. That picture includes the rare stuff — variants present at a frequency as low as 1% — as well as all forms of genetic variation in humans — SNPs, small insertions and deletions (called “indels”), as well as large changes in the structure and number of chromosomes (called “copy number variations” or CNVs).
Today the consortium publishes the first fruits of its efforts, including the results of its initial work to determine the most efficient and effective ways to make use of next-gen technologies. (Despite the rapid progress, it’s still quite expensive to fully decode thousands of people’s genomes.) Flowing from these pilot studies, the researchers have already compiled the most detailed map yet of human genetic variation — estimated to contain more than 95% of the genetic variation of any person on Earth.
Some interesting findings are already coming into view. For example, the scientists report that, on average, each person carries around 250 to 300 variants that disrupt the function of a gene, and about 50 to 100 variants that have been previously linked to disease. But since each person carries two copies of every gene, the vast majority of these changes are likely to go unnoticed, with little if any impact on overall health.
It’s worth pausing to note just how much data researchers collected in this first phase of the project: roughly 4.5 terabases of DNA sequence. That’s 4.5 million million genetic letters or “bases” — over 1,000 times the number in one person’s genome.
On top of this impressive stack of data is the cast of thousands that are helping to generate and analyze it. At the Broad Institute alone, scores of researchers are participating, including project co-chair David Altshuler; Stacey Gabriel, co-chair of the data production group; researchers Mark Daly, Mark DePristo, Charles Lee, and Steven McCarroll; members of the Genome Sequencing Platform and the Genome Sequencing and Analysis Program; and many others.
Of course, more work is yet to come. The consortium intends to decode the genomes of some 2,500 people from 27 countries, on the path toward a map that encompasses nearly 99% of all human genetic variation.