No such thing as unlabeled: Self-supervised learning on medical data

Nathaniel Diamant
Broad Institute

In medical datasets, the most important labels are often the rarest. For example, while responsible for more than 450,000 deaths a year in the United States alone, sudden cardiac death (SCD) will likely only show up in a few hundred health records in a hospital dataset of a hundred thousand patients. Furthermore, a binary label, like SCD, carries little information about the intricacies of the outcome. In contrast to the rarity and opacity of the labels, the relationships of data within a medical dataset are often plentiful and rich.

Self-supervised learning (SSL) is an approach to training deep learning models that ideally matches the characteristics of medical datasets. We propose Patient Contrastive Learning, an SSL approach which exploits a fundamental relationship in medical datasets: which data come from which patient. We train a Patient Contrastive Learning model in an unlabeled 3.2 million ECG dataset, and demonstrate massive improvements compared to training a neural network from scratch for four downstream tasks. We investigate the effects and scaling laws of different amounts of training data, both unlabeled and labeled. Finally, we look at recent theoretical developments in how to pick which SSL task to use for any given downstream task.