Data thinning to avoid double dipping

Assistant Professor of Statistics

University of British Columbia

"Double dipping" is the practice of using the same data to fit and validate a model. Problems typically arise when standard statistical procedures are applied in settings involving double dipping. To avoid the challenges surrounding double dipping, a natural approach is to fit a model on one dataset, and then validate the model on another independent dataset. When we only have access to one dataset, we typically accomplish this via sample splitting. Unfortunately, in some problems, sample splitting is unattractive or impossible. In this talk, we are motivated by unsupervised problems that arise in the analysis of single cell RNA sequencing data, where sample splitting does not allow us to avoid double dipping. We first propose Poisson thinning, which splits a single observation drawn from a Poisson distribution into two independent pseudo-observations. We show that Poisson count splitting allows us to avoid double dipping in unsupervised settings. We next generalize the Poisson thinning framework to a variety of distributions, and refer to this general framework as "data thinning". Data thinning is applicable far beyond the context of single-cell RNA sequencing data, and is particularly useful for problems where sample splitting is unattractive or impossible.

MIA Talks Search