You are here

Detecting novel associations in large data sets

David Reshef, Yakir Reshef
MIT CSAIL, HMS, Harvard CS
Detecting novel associations in large data sets

Abstract: As data sets grow in dimensionality, making sense of the wealth of interactions they contain has become a daunting task, not just due to the sheer number of relationships but also because relationships come in different forms (e.g. linear, exponential, periodic, etc.) and strengths. If you do not already know what kinds of relationships might be interesting, how do you find the most important or unanticipated ones effectively and efficiently? This is commonly done by using a statistic to rank relationships in a data set and then manually examining the top of the resulting list. For such a strategy to succeed though, the statistic must give similar scores to equally noisy relationships of different types. In this talk we will formalize this property, called equitability, and show how it is related to a variety of traditional statistical concepts. We will then introduce the maximal information coefficient, a statistic that has state-of-the-art equitability in a wide range of settings, and discuss how its equitability translates to practical benefits in the search for dependence structure in high-dimensional data using examples from global health and the human gut microbiome.

David Reshef, Yakir Reshef
MIT CSAIL, HMS, Harvard CS
Primer: Hypothesis testing and measures of dependence

Abstract: Searching for departures from statistical independence in data is a fundamental problem that has been formalized in a variety of ways. We will cover two frameworks in which this problem has historically been understood. The first is statistical and involves framing the search as a hypothesis test in a finite-sample setting. The second is probabilistic and involves defining functions of random variables that have useful properties in the large-sample limit. We will close with a discussion of common themes underlying measures of dependence arising from each of these paradigms.