Linear models are used across the sciences and social sciences to understand how outcomes of interest depend on a large collection of covariates. When datasets are noisy, or the number of data points is small relative to the number of covariates, hierarchical Bayesian modeling has potential to improve analyses by “sharing statistical strength” across multiple groups of data. For example, one might use a linear model to associate aspects of community level demography with historical rates of law enforcement and want to share strength across communities in different geographic regions. A standard Bayesian approach is to model effects as exchangeable across groups but to leverage correlation in effects across covariates within each group. However, we show that this approach exhibits poor statistical performance when the number of covariates exceeds the number of groups. For instance, in statistical genetics, we might regress dozens of traits (defining groups) for thousands of individuals (data points) on up to millions of genetic variants (covariates). Taking inspiration from earlier work in statistical genetics, we argue that when an analyst has more covariates than groups, it is preferable instead to model the effects for each covariate as exchangeable across covariates and leverage correlation across groups. To this end, we propose a new hierarchical model expressing our alternative perspective, and show with theory and simulation that this model produces more accurate estimates than the standard ones when the number of covariates dominates the number of groups. We corroborate this result with improved performance in several hierarchical regression and classification applications.
This talk is based on a paper with Hilary Finucane and Tamara Broderick appearing in NeurIPS 2021.