The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present Alpha Missense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying

"Double dipping" is the practice of using the same data to fit and validate a model. Problems typically arise when standard statistical procedures are applied in settings involving double dipping. To avoid the challenges surrounding double dipping, a natural approach is to fit a model on one dataset, and then validate the model on another independent dataset. When we only have access to one dataset, we typically accomplish this via sample splitting. Unfortunately, in some problems, sample splitting is unattractive or impossible. In this talk, we are motivated by unsupervised problems that

Today’s genomics workflows typically begin by aligning sequencing data to a reference. In addition to being slow, this has many statistical drawbacks. Even in the intensely studied human genome, it was found that understudied populations have large amounts of sequence missing from the current reference; such blind spots may exacerbate health disparities. Reference-based methods are additionally limited in their detection of novel biology: reads from unannotated isoforms may be mismapped or discarded completely. In recent work we introduce a unifying paradigm, SPLASH, which directly analyses

Myriad mechanisms diversify the sequence content of DNA and of RNA transcripts. Currently, these events are detected using tools that first require alignment to a necessarily incomplete reference genome alignment in the first step; this incompleteness is especially prominent in human genetic diseases such as cancer, in the microbial world, and in non—model organisms where it severely limits the speed and scope of discovery. Second, today the next step in analysis requires as a custom choice of bioinformatic procedure to follow it: for example, to detect splicing, RNA editing, or V(D)J