Succinct tree sequences for megasample genomics/ Introduction to the tree sequence toolchain

Jerome Kelleher
BDI, Oxford
Succinct tree sequences for megasample genomics

Abstract: The era of megasample genomics, where datasets routinely contain millions of sampled genomes, is upon us. Present-day computational methods are fundamentally organised around the variant matrix, where each row describes the observations for every sample at a given genomic location. At megasample-scale, such matrices are massively unwieldy and cannot be processed without complex parallel algorithms. We show that the recently-introduced succinct tree sequence data structure has the potential to hugely reduce storage and processing costs; that it directly encodes important biological signals; and that it has led to efficiency gains of several orders of magnitude in simulation and whole-genome ancestry inference. We examine the underlying algorithmic properties of tree sequences that enable such breakthrough performance gains, and also discuss a preliminary algorithm for exactly solving the Li and Stephens model in logarithmic time.

Wilder Wohns
Oxford Stats
Primer: Introduction to the tree sequence toolchain

Abstract: The succinct tree sequence data structure is a concise and efficient encoding of whole-genome ancestry and sequence data, with a rapidly maturing software ecosystem. The tskit (tree sequence toolkit) library provides a comprehensive framework for working with tree sequences using Python and C APIs. The ecosystem growing around this central technology now includes several genome simulators as well as our highly-scalable method for inferring ancestry from data, tsinfer. In this primer session, we will introduce tskit and the tree sequence data structure as well as demonstrate both the simulation and inference of genomic datasets in real-time using downloadable Jupyter Notebooks