The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present Alpha Missense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying

"Double dipping" is the practice of using the same data to fit and validate a model. Problems typically arise when standard statistical procedures are applied in settings involving double dipping. To avoid the challenges surrounding double dipping, a natural approach is to fit a model on one dataset, and then validate the model on another independent dataset. When we only have access to one dataset, we typically accomplish this via sample splitting. Unfortunately, in some problems, sample splitting is unattractive or impossible. In this talk, we are motivated by unsupervised problems that

Today’s genomics workflows typically begin by aligning sequencing data to a reference. In addition to being slow, this has many statistical drawbacks. Even in the intensely studied human genome, it was found that understudied populations have large amounts of sequence missing from the current reference; such blind spots may exacerbate health disparities. Reference-based methods are additionally limited in their detection of novel biology: reads from unannotated isoforms may be mismapped or discarded completely. In recent work we introduce a unifying paradigm, SPLASH, which directly analyzes

Myriad mechanisms diversify the sequence content of DNA and of RNA transcripts. Currently, these events are detected using tools that first require alignment to a necessarily incomplete reference genome alignment in the first step; this incompleteness is especially prominent in human genetic diseases such as cancer, in the microbial world, and in non—model organisms where it severely limits the speed and scope of discovery. Second, today the next step in analysis requires as a custom choice of bioinformatic procedure to follow it: for example, to detect splicing, RNA editing, or V(D)J

Modelling cell state-dependent genetic associations with single-cell gene expression exhibits statistical and computational challenges. First, parametrization of single-cell gene expression profiles is not a straightforward task because individual genes exhibit distinct distributions. Second, current single-cell datasets consist of hundreds of thousands to millions of cells, which constrains the ability to test associations in a scalable manner. In this talk, I will introduce a new generalizable approach to robustly identify cell state-dependent eQTLs in single-cell data. To overcome the

As single-cell RNA-seq datasets grow larger and more complex, they enable richer analyses of how gene expression varies between cells and people. However, methods designed for bulk data fail to account for the unique structure of single-cell gene expression. Researchers are now developing statistical models tailored to single-cell-resolution data for a variety of applications. In this primer, I will focus on single-cell models for the task of mapping expression quantitative trait loci (eQTLs) to find genetic variants associated with a gene's expression. Single-cell eQTL models have the

This primer talk is motivated by the practice of testing data-driven hypotheses. In the biomedical sciences, it has become increasingly common to collect massive datasets without a pre-specified research question. In this setting, a data analyst might use the data both to generate a research question, and to test the associated null hypothesis. For example, in single-cell RNA-sequencing analyses, researchers often first cluster the cells, and then test for differences in the expected gene expression levels between the clusters to quantify up- or down-regulation of genes, annotate known cell