Hundreds of proteins encoded in human genomes contain domains that vary in size or copy number due to variable numbers of tandem repeats (VNTRs) in exons. VNTRs have eluded analysis by the SNP arrays and high-throughput sequencing used in large-scale genetic studies to date; thus, the relationships of VNTRs to most phenotypes are unknown. To address this need, we developed methods to estimate VNTR allele lengths from whole-exome sequencing data, and to impute VNTR lengths into SNP array data. For sequenced individuals, we estimated VNTR length (summed across the two parentally derived alleles) from depth-of-coverage; we then used genotypes for surrounding SNPs to identify shared haplotypes, allowing us to refine these measurements and infer the length of individual alleles. The resulting SNP and VNTR haplotypes served as a reference for imputation into SNP array data. We analyzed 118 protein-altering VNTRs in 415,280 UK Biobank participants for association with 791 phenotypes.
These analyses uncovered 15 associations, involving 5 distinct VNTRs, for which fine-mapping analyses (accounting for nearby variation) indicated that the VNTR was likely to be causal. These VNTRs appeared to explain some of the largest GWAS signals for diverse phenotypes, including height, hair morphology, and serum urea. For example, a VNTR encoding a 19-amino-acid repeat in aggrecan (ACAN) associated with height variation of 3.4 centimeters.
At LPA, our approach enabled a deep fine-mapping analysis that explained ~89% of the heritability of Lipoprotein(a) concentration (Lp(a)), a major cardiovascular risk factor. A 5.5kb repeat spanning two exons of LPA is known to explain roughly half of Lp(a) variance. To map the remaining genetic variance, we leveraged 31,421 individuals who were heterozygous for a null Lp(a) variant, allowing us to analyze the effect of variants on Lp(a) levels in an effective haploid-model. This analysis identified 14 coding and splice variants affecting 43% of European LPA alleles that appeared to drastically decrease Lp(a) independent of VNTR length. Population differences in the frequencies of these variants and VNTR alleles explained the 4-fold variation in median Lp(a) across populations.
These results point to strong, cryptic effects of highly polymorphic common structural variants that have largely eluded molecular analyses to date.