BroadE: Introduction to Machine Learning on Biomedical Data

The course will begin with a very brief overview of the mathematical foundations of ML, specifically linear regression, logistic regression, and multilayer perceptrons. Applying these models to the well-studied dataset of hand written digits, MNIST, we’ll gain first-hand experience with model selection, training and validation. We will then introduce several abstractions from the ML4CVD codebase (TensorMaps, Model Factories, and Tensorization) which accelerate and simplify the process of preparing data for ML, building models, and evaluating them. Then we will examine several real world biomedical datasets of various sizes, structures and quality. Specifically, from the Allen Brain Atlas we will load high-dimensional brain MRI data linked with gene expression microarrays, from Qure.ai we will use the CQ500 dataset of 500 CT scans containing 193K slices linked with medical reports from 3 senior radiologists, and lastly from the Erowid website raw natural language text testimonials of drug experiences linked with basic demographics. These data will present many new challenges, such as noisy labels, small sample sizes, confounding by indication, and batch effects. These issues are less prevalent in typical ML data, like MNIST or imagenet, but very common with biomedical data. After visually and statistically exploring the datasets, we will setup several ML problems using them as raw data. Lastly, we will explore model interpretation by visualizing saliency maps, class activation maps, and training adversarial examples. The course will conclude with a general discussion on framing biological questions as machine learning problems and the opportunity for participants to brainstorm how ML might apply to their own datasets.

Related links

video | Docker

Data Link 1
Data Link 2
Data link 3