HapMap 3

This is release 2 for genome-wide SNP genotyping and targeted sequencing in DNA samples from a variety of human populations (sometimes referred to as the "HapMap 3" samples).

This release contains the following data:

  • SNP genotype data generated from 1184 samples, collected using two platforms: the Illumina Human1M (by the Wellcome Trust Sanger Institute) and the Affymetrix SNP 6.0 (by the Broad Institute). Data from the two platforms have been merged for this release.
  • PCR-based resequencing data (by Baylor College of Medicine Human Genome Sequencing Center) across ten 100-kb regions (collectively referred to as "ENCODE 3") in 712 samples.

Please note that there is a release 3 of HapMap 3 data that can be downloaded from the central HapMap website.

DATA PRODUCTION INSTITUTIONS

  • Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC)
  • Broad Institute of Harvard and MIT (BI)
  • Wellcome Trust Sanger Institute (WTSI)
    • FUNDING AGENCIES

      HAPMAP 3 SAMPLES

      The HapMap 3 sample collection comprises 1,301 samples (including the original 270 samples used in Phase I and II of the International HapMap Project) from 11 populations, listed below alphabetically by their 3-letter labels. For more information about these samples, click here.

      labelpopulation samplenumber of samples
      ASWAfrican ancestry in Southwest USA90
      CEUUtah residents with Northern and Western European ancestry from the CEPH collection180
      CHBHan Chinese in Beijing, China90
      CHDChinese in Metropolitan Denver, Colorado100
      GIHGujarati Indians in Houston, Texas100
      JPTJapanese in Tokyo, Japan91
      LWKLuhya in Webuye, Kenya100
      MXLMexican ancestry in Los Angeles, California90
      MKKMaasai in Kinyawa, Kenya180
      TSIToscans in Italy100
      YRIYoruba in Ibadan, Nigeria180

      ENCODE 3 REGIONS

      Five of the ten ENCODE 3 regions overlap with the HapMap-ENCODE regions; the other five are regions selected at random from the ENCODE target regions (excluding the 10 HapMap-ENCODE regions). All ENCODE 3 regions are 100-kb in size, and are centered within each respective ENCODE region. Read more about the ENCODE project here.

      regionchromosomecoordinates (NCBI build 36)status
      ENm010727,124,046-27,224,045HapMap-ENCODE
      ENr3218119,082,221-119,182,220HapMap-ENCODE
      ENr2329130,925,123-131,025,122HapMap-ENCODE
      ENr1231238,826,477-38,926,476HapMap-ENCODE
      ENr2131823,919,232-24,019,231HapMap-ENCODE
      ENr3312220,185,590-220,285,589New
      ENr221556,071,007-56,171,006New
      ENr2331541,720,089-41,820,088New
      ENr3131661,033,950-61,133,949New
      ENr1332139,444,467-39,544,466New

      DATA CONTENT OF THIS RELEASE

      A. SNP GENOTYPE DATA

      labelnumber of samplesnumber of QC+ SNPsnumber of polymorphic QC+ SNPs
      ASW8316568771565172
      CEU16516486531416121
      CHB8416627671332120
      CHD8516468941309662
      GIH8816529071411455
      JPT8616630871300764
      LWK9016499041533540
      MXL7715856241413654
      MKK17116357801541375
      TSI8816559751423618
      YRI16716521981505108
      consensus118414721301440616

      B. PCR RESEQUENCING DATA

      labelnumber of samples
      ASW55
      CEU119
      CHB90
      CHD30
      GIH60
      JPT91
      LWK60
      MXL27
      MKK0
      TSI60
      YRI120
      total712

      QUALITY CONTROL FOR THIS RELEASE

      A. SNP GENOTYPE DATA

      Genotyping concordance between the two platforms was 0.9949 (computed over 250,000 overlapping SNPs between the two platforms). Data from the two platforms was merged using PLINK (--merge-mode 1), keeping only genotype calls if there is consensus between non-missing genotype calls (that is, merged genotype is set to missing if the two platforms give different, non-missing calls).

      Quality control at the individual level was performed separately by the two sites. Only individuals with genotype data on both platforms were kept in this release. The following criteria were used to keep SNPs in the QC+ data sets:

      • Hardy-Weinberg p>0.000001 (per population)
      • missingness <0.05 (per population)
      • <3 Mendel errors (per population; only applies to YRI, CEU, ASW, MXL, MKK)
      • SNP must have a rsID and map to a unique genomic location

      The "consensus" data set contains data for 1184 individuals (589 males & 595 females, 988 founders & 196 non-founders), only keeping SNPs that passed QC in all populations (overall call rate is 0.998). The "consensus|polymorphic" data set has 31514 monomorphic SNPs (across the entire data set) removed.

      In all genotype files, alleles are expressed as being on the (+/fwd) strand of NCBI build 36.

      B. PCR RESEQUENCING DATA

      The sequence-based variant calls were generated by tiling with PCR primer sets spaced approximately 800 bases apart across the ENCODE 3 regions. Following filtering low-quality reads the data were analyzed with SNP Detector version 3, for polymorphic site discovery and individual genotype calling. Various QC filters were then applied. Specifically, we filtered out PCR amplicons with too many SNPs, and SNPs with discordant allele calls in mutliple amplicons. We also filtered out SNPs with low completeness in samples, or with too many conflicting genotype calls in two different strands.

      In the QC+ data set, we filtered out samples with low completeness, and filtered out SNPs with low call rate in each population (<80%) and not in HWE (p<0.001). In the QC+ data set, the overall false positive rate is ~3.2%, based on a limited number of validation assays.

      CAVEATS IN THIS RELEASE

      A. SNP GENOTYPE DATA

      • Missing from this release are Illumina SNPs that are A/T or C/G due to strandedness issues (except those that were also typed on Affy, where we were able to confirm strand orientation).
      • Missing from this release are Illumina SNPs that are mitochondrial (as they do not have rsIDs).
      • There may be few remaining SNPs in this release that are still on (-/rev) strand of NCBI build 36. We are continuing to work out the strand orientation of all SNPs, which will be released as release 3.

      B. PCR RESEQUENCING DATA

      All variant calls have not yet been validated: we estimate that there is currently a false positive rate of ~12% among all calls, with a slightly higher rate (~14%) if considering just the singletons. Additional validation is ongoing. PCR sequencing of additional samples (MKK) is also ongoing.

      HOW TO DOWNLOAD THIS RELEASE

      A. SNP GENOTYPE DATA

      B. PCR RESEQUENCING DATA

      To access the ENCODE III PCR resequencing data, please visit the BCM-HGSC public ftp site at ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Encode or download here:

      ANALYSIS PLANS

      Listed below are the analysis plans that we are currently pursuing:

      • SNP allele frequency estimation
      • Population differentiation
      • Linkage disequilibrium analysis
      • SNP tagging
      • Imputation efficiency
      • Genomic locations of human CNVs
      • Genotypes for CNVs
      • Population genetic properties of CNVs (allele frequencies, population differentiation, etc.)
      • Mutation rate (frequency of de novo CNV) and potential mutational mechanisms
      • Linkage disequilibrium properties of CNVs
      • Tagging and imputation of CNVs
      • Signals of selection around CNVs
      • Association of SNPs and CNVs with expression phenotypes

      DATA RELEASE POLICY

      The release of pre-publication data from large resource-generating scientific projects was the subject of a meeting held in January 2003, the "Fort Lauderdale" meeting. An NHGRI policy statement based on the outcome of the meeting is on the NHGRI web site (http://www.genome.gov/10506537).

      The recommendations of the Fort Lauderdale meeting address the roles and responsibilities of data producers, data users, and funders of "community resource projects", with the aim of establishing and maintaining an appropriate balance between the interests of data users in rapid access to data and the needs of data producers to receive recognition for their work. The conclusion of the attendees at the meeting was that responsible use of the data is necessary to ensure that first-rate data producers will continue to participate in such projects and produce and quickly release valuable large-scale data sets. "Responsible use" was defined as allowing the data producers to have the opportunity to publish the initial global analyses of the data, as articulated at the outset of the project. Doing so also will ensure that the data generated are fully described.

      LINKS