Neale Lab, Broad Institute

Principal component analysis (PCA) is the standard method for estimating population structure and sample ancestry in genetic datasets. Population structure can induce confounding in genome-wide association studies (GWAS), which is typically addressed by including principal components (PCs) as covariates. However, results from random matrix theory (RMT) predict that PCA fails to detect population differentiation below a particular threshold and that even above the threshold, sample PCs may be only partially correlated with true axes of differentiation. These phenomena depend for each PC on the corresponding eigenvalue; we extend previous work to characterize and interpret the eigenvalues for general population structures. Moreover, we propose an estimator for the effective number of unlinked variants that outperforms previous moments-based estimators, which we then combine with RMT results to estimate the inaccuracy of each PC and predict how this inaccuracy leads to residual confounding in GWAS on stratified phenotypes. We validate our method via downsampling experiments on real data including the UK Biobank and suggest this behavior may be driving the uncorrected stratification recently observed in some large meta-analyses of smaller GWAS.

MIA Talks Search