Phasing of many thousands of genotyped samples.

Am J Hum Genet

Authors	Amy Williams Nick Patterson Joseph Glessner Hakon Hakonarson David Reich
Keywords	Humans Computational Biology Algorithms Internet Haplotypes Software Genetics Sample Size Research Design
Abstract	Haplotypes are an important resource for a large number of applications in human genetics, but computationally inferred haplotypes are subject to switch errors that decrease their utility. The accuracy of computationally inferred haplotypes increases with sample size, and although ever larger genotypic data sets are being generated, the fact that existing methods require substantial computational resources limits their applicability to data sets containing tens or hundreds of thousands of samples. Here, we present HAPI-UR (haplotype inference for unrelated samples), an algorithm that is designed to handle unrelated and/or trio and duo family data, that has accuracy comparable to or greater than existing methods, and that is computationally efficient and can be applied to 100,000 samples or more. We use HAPI-UR to phase a data set with 58,207 samples and show that it achieves practical runtime and that switch errors decrease with sample size even with the use of samples from multiple ethnicities. Using a data set with 16,353 samples, we compare HAPI-UR to Beagle, MaCH, IMPUTE2, and SHAPEIT and show that HAPI-UR runs 18× faster than all methods and has a lower switch-error rate than do other methods except for Beagle; with the use of consensus phasing, running HAPI-UR three times gives a slightly lower switch-error rate than Beagle does and is more than six times faster. We demonstrate results similar to those from Beagle on another data set with a higher marker density. Lastly, we show that HAPI-UR has better runtime scaling properties than does Beagle so that for larger data sets, HAPI-UR will be practical and will have an even larger runtime advantage. HAPI-UR is available online (see Web Resources).
Year of Publication	2012
Journal	Am J Hum Genet
Volume	91
Issue	2
Pages	238-51
Date Published	2012 Aug 10
ISSN	1537-6605
URL	http://linkinghub.elsevier.com/retrieve/pii/S0002-9297(12)00322-9
DOI	10.1016/j.ajhg.2012.06.013
PubMed ID	22883141
PubMed Central ID	PMC3415548
Links	PubMed Google Scholar DOI
Grant list	R01 GM100233 / GM / NIGMS NIH HHS / United States F32 HG005944 / HG / NHGRI NIH HHS / United States 076113 / Wellcome Trust / United Kingdom 085475 / Wellcome Trust / United Kingdom F32HG005944 / HG / NHGRI NIH HHS / United States

Recent Broad Publications

An esophagus cell atlas reveals dynamic rewiring during active eosinophilic esophagitis and remission.

Genome-wide association analyses identify 95 risk loci and provide insights into the neurobiology of post-traumatic stress disorder.

ACE inhibitors and angiotensin receptor blockers differentially alter the response to angiotensin II treatment in vasodilatory shock.

Bioinformatics pipeline for the systematic mining genomic and proteomic variation linked to rare diseases: The example of monogenic diabetes.

Multi-ancestry meta-analysis of tobacco use disorder identifies 461 potential risk genes and reveals associations with multiple health outcomes.