Researchers expand and upgrade the 1000 Genomes Project resource

Using high-coverage whole-genome sequencing and improved analytic approaches, scientists have re-sequenced samples from the project to improve the publicly accessible resource.

Ricardo Job-Reese, Broad Communications
Credit: Ricardo Job-Reese, Broad Communications

Seven years ago, the 1000 Genomes Project (1kGP) published an open-access resource based primarily on low-coverage whole-genome sequencing (WGS) data of 2,504 individuals from 26 populations representing five continental regions of the world, making it the first large-scale WGS effort to deliver a catalog of human genetic variation.

Now, researchers at the New York Genome Center, in collaboration with groups at the Massachusetts General Hospital, Yale University, and Human Genome Structural Variation Consortium (HGSVC), have expanded the 1kGP resource to include nearly all parent-child trios in the collection, alongside the original samples, and sequenced them at high coverage using Illumina NovaSeq instruments. The study, published in Cell, presents comprehensive analyses of the high-coverage WGS data on the expanded 1kGP cohort which now consists of 3,202 samples, including 602 trios.

“The 1000 Genomes Project cohort is such a valuable resource, we felt it would be useful to the community to bring the sequencing up to date with the latest version of short-read technology while adding in the richness of the previously omitted family samples,” explained Michael Zody, scientific director of computational biology at the New York Genome Center, and the study’s senior author.

Using state-of-the-art methods and algorithms, researchers at the New York Genome Center sequenced DNA derived from lymphoblastoid cell lines (immortalized human B cells from peripheral blood) from the expanded cohort to a targeted depth of 30X genome coverage. Next, the group performed single nucleotide variant (SNV) and short insertion and deletion (INDEL) calling, which consists of identification of variant sites from the sequence data relative to the human genome reference and genotyping of discovered variant sites across all samples in the cohort.

Additionally, a team from Michael Talkowski’s group at the Harvard Medical School, the Broad Institute of MIT and Harvard, and Massachusetts General Hospital, in collaboration with Ira Hall’s group at Yale University and the Washington University School of Medicine, as well as the HGSVC, discovered and genotyped a comprehensive set of structural variants (SVs) across the 3,202 1kGP samples by integrating multiple analytic approaches.

Overall, the study shows significant improvements in both discovery power and precision of variant calls, especially among rare SNVs as well as INDELs and SVs spanning the frequency spectrum, which were previously inaccessible with low-coverage sequencing.

An important aspect of the original 1kGP resource is its use as a reference panel for variant imputation, i.e., statistical inference of unobserved genotypes in sparse, array-based samples based on groupings of variants that are typically inherited together in the population learned from the reference panel, which facilitated numerous genome-wide association studies (GWAS). Now, with the expansion of the original resource, the team upgraded the reference imputation panel to include more variants discovered through high-coverage WGS and trio families.

“The new imputation panel includes more sites, especially many more common INDELs and SVs, thus expanding the number of variants accessible for GWAS, which, given the large effect size of non-SNV variation, is likely to enable discovery of new genetic associations that help pinpoint the causative variant,” explained Marta Byrska-Bishop, a senior bioinformatics scientist at the New York Genome Center, and the study’s co-first author.

All raw sequence data and variant call sets were immediately released to the public upon sequencing completion via several genomic data repositories, including the International Genome Sample Resource which is maintained by co-authors from the European Bioinformatics Institute at the European Molecular Biology Laboratory.

“Our goal is to have this public resource serve as the benchmark for future population genetic studies and methods development,” added Xuefang Zhao, a postdoctoral fellow at the Center for Genomic Medicine at Massachusetts General Hospital and the Broad Institute, and the study’s co-first author.

The data have already gathered interest from the genetics and genomics community. This will likely continue for years to come thanks to the fully open-access nature of the 1kGP samples which, unlike most newly emerging WGS efforts, are consented for public distribution of genetic data without access or use restriction.

Sequencing was supported by grants from the National Human Genome Research Institute (NHGRI). This analysis was partly supported by grants from NHGRI, the National Institute of Child Health and Human Development (NICHD), the National Institute of Mental Health (NIMH), the European Molecular Biology Laboratory (EMBL), and the Wellcome Trust.

Adapted from a press release from the New York Genome Center.  

Paper(s) cited

Byrska-Bishop M et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell. Online September 1, 2022. DOI: 10.1016/j.cell.2022.08.004