Sep 14, 2015
Nikolai Slavov, Northeastern Bioengineering
Quantifying protein isoforms
Many protein isoforms -- arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes -- have distinct biological functions. However, the accuracy of quantifying protein isoforms and their stoichiometries by existing mass--spectrometry (MS) methods remains limited because of noise due to variations in protein-digestion and in peptide-ionization. We eliminate the influence of this analytical noise by deriving a first-principles model (HIquant) for quantifying these stoichiometries only from corresponding-ion ratios. This approach allows unprecedented accuracy (error < 10%) in quantifying ratios between different proteins and their isoforms. I will discuss a mathematical proof of the conditions under which HIquant has a unique solution, and algorithms for its optimal solution.
This paper by Nikolai is a good reference for the mathematical issues, but not the biological ones:
Convex Total Least Squares
In a diverse set of problems ranging from dimensionality reduction to earthquake imaging it is often the case that we seek to identify sparse or quantized representations of signals. The rationale for this may be: 1) physical (I believe that there are only a few interesting things), 2) philosophical (I only want to think about a few interesting things), or 3) computational (I only have enough computer to work with a few interesting things). The past two decades have seen a radical progress in efficiently solving sparse recovery problems and these approaches are now being used for non-trivial estimation problems. This talk focuses on sparse and quantized signal recovery from historical, geometric, and philosophical perspectives, with examples from earthquake physics.
Here are two of Brendan's papers with examples:
Total variation regularization of geodetically and geologically constrained block models for the Western United States
Geodetic imaging of coseismic slip and postseismic afterslip: Sparsity promoting methods applied to the great Tohoku earthquake
How many linear measurements (equations) do you need to recover a high-dimensional signal (unknowns)? If you know a basis in which the signal is sparse, and your measurements are not too aligned with this basis, then far fewer than you might expect. Moreover, you can recast your underdetermined problem as a convex program and solve it efficiently. I will talk about when and why this works, mentioning some now-classic applications and a few exciting possibilities in biology.
Our aim is to give background and motivation for Scott's talk next week. Consider SNP association testing against a binary phenotype (disease vs. no disease). While linear regression enjoys very efficient inference, the simplest version is lacking due to:
- erroneous hard calls of variants (go with probabilities)
- multiple testing (go Bonferonni, FDR)
- confounding by ancestry, batch effects (go add PCs)
- cryptic relatedness (go full mixed model)
- binary phenotype (go logistic)
- overfitting (go Bayesian)
- nonlinear dependence of phenotype on covariates (go Gaussian process?)
- admixture (go topic model?)
- non-normal distribution of effect sizes (go GMM prior?)
- sparsity (go lasso?)
- epistatis (go neural net?)
- ascertainment bias (go do some research)
- high-dimensional phenotypes, both continuous and categorical (go do some modeling)
We will describe models addressing some of these points including Bayesian probit, logit, and mixed logit models, and time-permitting, some fancier models mixing continuous and discrete structure. Our emphasis will be on how exponential-family conjugacy makes inference easy via Gibbs sampling in certain cases, whereas its absence leads one toward despair (at least for six more days).
We often have discrete count data with continuous latent structure or continuous regressors. It can be hard to match these two up in a Bayesian framework because of lack of conjugacy. Fortunately, there's a cool trick (Polya-gamma augmentation) that allows us to render the discrete observations conjugate with a Gaussian prior, facilitating:
- Bayesian logistic regression, more efficiently
- structured sparse Gaussian models
- hierarchical Gaussian models (eg GMMs) with binary observations
- time series or Gaussian processes to capture dependencies between observations
We can extend this to other observation models too, like binomial, negative binomial, and multinomial observations. So if you know about LDA, now it's easy to combine LDA with Gaussian structure like correlated or dynamic topics.
Here is the original Polya-gamma augmentation paper from 2013, as well as Scott's hot-off-the-press work with Ryan Adams and Matt Johnson:
Bayesian inference for logistic models using P´olya-Gamma latent variables
Dependent Multinomial Models Made Easy: Stick Breaking with the Pólya-Gamma Augmentation
Although the genetic information in each cell within an organism is identical, gene expression varies widely between different cell types. The quest to understand this phenomenon has led to many interesting mathematics problems. First, I will present a new method for learning gene regulatory networks. It overcomes the limitations of existing algorithms for learning directed graphs and is based on algebraic, geometric and combinatorial arguments. Second, I will analyze the hypothesis that the differential gene expression is related to the spatial organization of chromosomes. I will describe a bi-level optimization formulation to find minimal overlap configurations of ellipsoids and model chromosome arrangements. Analyzing the resulting ellipsoid configurations has important implications for the reprogramming of cells during development.
Here are some of Caroline's papers on packing problems and learning graphical models:
Packing Ellipsoids with Overlap
Sphere Packing with Limited Overlap
Learning directed acyclic graphs based on sparsest permutations
Faithfulness and learning hypergraphs from discrete distributions
Nov 2, 2015
Dougal Maclaurin, HIPS
NN I. Reverse-mode differentiation and autograd
Much of machine learning boils down to constructing a loss function and optimizing it, often using gradients. Reverse-mode differentation (sometimes called "backpropagation") is a general and computationally efficient way to compute these gradients. I'll explain reverse-mode differentiation and show how we've implemented it for Python/Numpy in our automatic differentation package autograd. I'll finish with some demos showing how easy it is to implement several machine learning models once you have automatic differentiation in your toolbox.
Predicting properties of molecules requires functions that take graphs as inputs. Molecular graphs are usually preprocessed using hash-based functions to produce fixed-size fingerprint vectors, which are used as features for making predictions. We introduce a convolutional neural network that operates directly on graphs, allowing end-to-end learning of the feature pipeline. This architecture generalizes standard molecular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
Here is David and Dougal's paper, the github repositories, as well as a great review of conv nets:
Convolutional Networks on Graphs for Learning Molecular Fingerprints
Molecular fingerprint repository
Review of conv nets
Nov 12, 2015
Professor Ryan Adams
Machine Learning and the Life Science: Beyond Data Analysis
1pm: Colloquium in the auditorium
2pm: light refreshment in the lobby
Machine learning is about understanding and building computational processes for adapting to data and experience, something that for most of natural history has only existed in living organisms. Lying at the interface between computer science and statistics, machine learning has in recent years come into the spotlight for providing rich new tools for data analysis. While machine learning is interacting with many different scientific areas, collaborations with the life sciences have been particular exciting as biology invests increasingly in automation and high-throughput data collection methods.
It is an amazing time for computer scientists and biologists to work together, but we can go far beyond data analysis. I will discuss two such collaborative areas that push this boundary: the automated design of biologically-relevant systems, and the exploration of adaptive algorithms in biological substrates. For the former, I will describe ongoing work to automate the process of design of systems such as organic molecules, DNA sequences, and biomimetic robots. For the latter, I will give an overview of recent work showing how important classes of machine learning algorithms can be implemented with biomolecules, without resorting to digital models for chemical computation.
BIO: Ryan Adams is Head of Research at Twitter Cortex and an Assistant Professor of Computer Science at Harvard. He received his Ph.D. in Physics at Cambridge as a Gates Scholar. He was a CIFAR Junior Research Fellow at the University of Toronto before joining the faculty at Harvard. He has won paper awards at ICML, AISTATS, and UAI, and his Ph.D. thesis received Honorable Mention for the Savage Award for Theory and Methods from the International Society for Bayesian Analysis. He also received the DARPA Young Faculty Award and the Sloan Fellowship. Dr. Adams was the CEO of Whetlab, a machine learning startup that was recently acquired by Twitter, and co-hosts the Talking Machines podcast.
Here are several resources on Bayesian optimization and chemical reaction networks (see also, Jon's talk on May 4, 2015):
Practical Bayesian optimization of machine learning algorithms
Message passing inference with chemical reaction networks
The complex language of eukaryotic gene expression remains incompletely understood. Thus, most of the many noncoding variants statistically associated with human disease have unknown mechanism. Here, we address this challenge using an approach based on a recent machine learning advance—deep convolutional neural networks (CNNs). We introduce an open source package Basset (https://github.com/davek44/Basset) to apply deep CNNs to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNaseI-seq. Basset predictions for the change in accessibility between two variant alleles were far greater for GWAS SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell???s chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.
Here is David's paper with Jasper Snoek (HIPS, Twitter Cortext) and PI John Rinn, as well as a great graphical explanation of (2d) conv nets:
Learning the regulatory code of the accessible genome with deep convolutional neural nets
Review of conv nets
Complex animal behaviors are likely built from simpler modules, but their systematic identification in mammals remains a significant challenge. We use depth imaging to show that three-dimensional (3D) mouse pose dynamics are structured at the sub-second timescale. Computational modeling of these fast dynamics effectively describes mouse behavior as a series of reused and stereotyped modules with defined transition probabilities, which collectively encapsulate the underlying structure of mouse behavior within a given experiment. By deploying this 3D imaging and machine learning method in a variety of experimental contexts, we show that it unmasks potential strategies employed by the brain to generate specific adaptations to changes in the environment, and captures both predicted and previously-hidden phenotypes induced by genetic or neural manipulations. Further, we demonstrate its utility in automatically unblinding the behavioral effects of pharmacological manipulation. This work demonstrates that mouse body language is built from identifiable components and is organized in a predictable fashion; deciphering this language establishes an objective framework for characterizing the influence of environmental cues, genes and neural activity on behavior.
Joint with Matt Johnson, who will tell us more about the underlying models next week.
Probabilistic generative modeling can help us discover structured representations from unsupervised time series data. I'll survey some basic ideas from Bayesian modeling and inference for time series and give examples of how they can be composed and extended. In particular, I'll focus on building up Bayesian switching linear dynamical systems (SLDS) and associated sampling and structured mean field inference algorithms, motivated by applications to behavior modeling from last week.
If time permits, I'll also talk about our current work on integrating these structured Bayesian generative models with the right amount of "neural net goo" to combine their respective strengths. I might also show some magic tricks with autograd.
Linear dynamical systems: https://en.wikipedia.org/wiki/Linear_dynamical_system
Talking Machines on LDS and SLDS: 1m45s - 10m
Matt's thesis on Bayesian time-series modeling: http://www.mit.edu/~mattjj/thesis.pdf
And here's an interesting autograd example which uses that gradient-through-forward-pass method: https://github.com/HIPS/autograd/blob/master/examples/hmm_em.py
Sep 22, 2014
Contingency tables I: t-test and z-test
Sep 29, 2014
Contingency tables II: correlation, Pearson chi-squared test, and Fisher exact test
Oct 6, 2014
Contingency tables III: examples in genetics
Nov 3, 2014
Jon Bloom and Bertrand Haas
Puzzle day: drunk Monty Hall, the two envelopes, the bloody crime scene, and Simpson's paradox
Nov 24, 2014
Principle component analysis (PCA) and the Marchenko-Pastur law
Dec 8, 2014
Non-negative matrix factorization (NMF)
Dec 15, 2014
Jon Bloom and Cotton Seed
Non-linear dimensional reduction: tSNE and diffusion maps
Jan 26, 2015
Independent component analysis (ICA) and projection pursuit
Feb 24, 2015
Alex Bloemendal, Jon Bloom, Bertrand Haas, Cotton Seed
Comparison of dimensional reduction methods: PCA, ICA, NMF, tSNE, and diffusion maps
Mar 2, 2015
Introduction to Bayesian graphical models: the Gaussian mixture model (Bishop, Ch8)
Mar 9, 2015
Expectation maximization and inference on Gaussian mixture models (Bishop, Ch9)
Mar 16, 2015
Variational Bayes and inference on Gaussian mixture models (Bishop, Ch10)
Mar 23, 2015
Laura Gauthier and Bertrand Haas
Variant quality score recalibration
Mar 30, 2015
Markov Chain Monte Carlo and Gibbs sampling on Gaussian mixture models (Bishop, Ch11)
April 7, 2015
Genetic fingerprints and contamination estimation
April 14, 2015
Linear mixed models for genetic association analysis
April 28, 2015
Connectivity map and challenges in data normalization
May 4, 2015
Introduction to Gaussian processes and Bayesian optimization (Bishop, Ch6)
June 1, 2015
Graph-based genetic sequence representation
June 8, 2015
Choosing priors in Bayesian inference
June 15, 2015
Discussion of Pachter's p-value prize
June 22, 2015
Conjugate priors and Hardy-Weinberg equilibrium (Bishop, Ch2)
June 29, 2015
Introduction to Dirichlet processes
July 6, 2015
The Chinese restaurant process and Indian buffet process
July 13, 2015
Introduction to evolutionary algorithms and NEAT
July 20, 2015
LD score regression for distinguishing confounding from polygenicity
July 27, 2015
Challenges in normalization of RNAseq data