Simulating, storing and processing genetic variation data for millions of samples

Jerome Kelleher
Wellcome Trust Centre for Human Genetics, Oxford
Simulating, storing and processing genetic variation data for millions of samples

Abstract: Coalescent theory has played a key role in modern population genetics and is fundamental to our understanding of genetic variation. While simulation has been essential to coalescent theory from its beginnings, simulating realistic population-scale genome-wide data sets under the exact model was, until recently, considered infeasible. Even under an approximate model, simulating more than a few tens of thousands samples was very time consuming and could take several weeks to complete a single replicate. However, by encoding simulated genealogies using a new data structure (called a tree sequence), we can we now simulate entire chromosomes for millions of samples under the exact coalescent model in a few hours. We discuss some applications that these simulations have made possible, including a study of biases in human GWAS and the systematic benchmarking of variant processing tools at scale. The tree sequence data structure is also an extremely concise way of representing genetic variation data, and we show how variant data for millions of simulated human samples can be stored in only a few gigabytes. Moreover, we show that this very high level of compression does not incur a decompression cost. Because the information is represented in terms of the underlying genealogies, operations such as computing allele frequencies on sample subsets or measuring of linkage disequilibrium can be made very efficient. Finally, we discuss ongoing work on inferring tree sequences from observed data and present some preliminary results.