
Dylan Agyemang

Dylan, a junior studying Applied Mathematics and Statistics at the University of North Carolina at Chapel Hill, leveraged statistical modeling to improve the accuracy and efficiency of RNA-sequencing gene count estimation.
Droplet-based RNA sequencing (RNA-seq) is a powerful and widely used technique for quantifying RNA molecules in biological samples, providing detailed insights into gene expression at the single-cell level. Before joining BSRP, I expected to work alongside brilliant scientists and innovators, and that expectation was fully met. However, I was pleasantly surprised by the welcoming and caring nature of the BSRP community. The mentorship and unwavering support I received created an environment where I could thrive both personally and professionally. This summer not only allowed me to apply advanced research techniques, but it also gifted me with meaningful connections and relationships that I will cherish for a lifetime. A crucial component of RNA-seq analysis is the use of Unique Molecular Identifiers (UMIs), which are attached to RNA molecules before sequencing. UMIs are essential for mitigating PCR amplification biases through deduplication, ensuring accurate and reliable quantification of gene expression. However, despite their importance, there is minimal understanding of the tradeoff between UMI length (complexity) and gene expression accuracy, leading to no consensus on the optimal UMI length across different sequencing protocols. To address this issue, we develop a statistical model to relate true gene counts with those computed using length k UMIs. Our model accounts for the collisions between UMIs, and can be extended to account for the non-uniform distribution of UMIs and gene-specific variation. By inverting the relationship between true gene counts and those obtained from shorter UMIs, we aim to provide accurate gene count estimation from shorter, cheaper UMIs, requiring less sequencing. The goal of this work is to develop an improved RNA-seq quantification pipeline that can yield similar or improved accuracy, with simplified genomic kits and reduced sequencing costs. By refining RNA-seq techniques and improving our understanding of how gene expression dynamics are influenced by technical parameters such as UMI length, our work contributes to more precise and cost-effective RNA-seq analysis.
Project: Enhancing Single Cell RNA Sequencing Accuracy: The Role of UMI Length
Mentor: Tavor Baharav, Schmidt Center