Statistical and algorithmic challenges in reference-free analysis
Postdoctoral fellow Eric and Wendy Schmidt Center Broad Institute
Today’s genomics workflows typically begin by aligning sequencing data to a reference. In addition to being slow, this has many statistical drawbacks. Even in the intensely studied human genome, it was found that understudied populations have large amounts of sequence missing from the current reference; such blind spots may exacerbate health disparities. Reference-based methods are additionally limited in their detection of novel biology: reads from unannotated isoforms may be mismapped or discarded completely. In recent work we introduce a unifying paradigm, SPLASH, which directly analyzes raw sequencing data, using a statistical test to detect a signature of regulation: sample-specific sequence variation. SPLASH detects many types of variation and can be efficiently run at scale, providing a unifying statistical approach to genomic analysis that enables expansive discovery without metadata or references. In this primer I’ll discuss some of the challenges of reference-free analysis, and provide the algorithmic and statistical background for our proposed solution, SPLASH.