The Models, Inference & Algorithms Initiative (MIA) supports learning and collaboration across the interface of biology and mathematics, statistics, machine learning, and computer science. Our weekly meeting features a primer, breakfast, seminar, and discussion; these are open and pedagogical, celebrating lucid exposition of computational ideas over the rapid-fire communication of results. Other MIA functions include hosting workshops, developing educational resources, advising leadership, and supporting the computational community to do its best work.
Super deep generative networks for cat robots: or how I learned to start worrying more about the public conversation
Abstract: By exploring imagery and tropes used to communicate about machine learning and artificial intelligence in the public conversation and the expert community we’ll place a clearer focus on the work that needs to be done in order to bring the reality of research into the world.
The story algorithm
Abstract: It is essential for any researcher to be able to communicate not only to other experts but to a lay audience about the reality of their research. In this workshop we will explore the fundamentals of scientific storytelling and work to build skill sets participants can use in the future.
Analyzing scientific data with topological data analysis
Abstract: I'll give a gentle introduction to the techniques and core ideas of topological data analysis (TDA), with a focus on the application of these methods to scientific data. I will emphasize the integration of TDA methods with statistics. My goal is to communicate what we can (and cannot) expect TDA to tell us and when TDA is likely to be a meaningful and robust analytic tool.
Using random matrix theory to extract signals from single-cell expression data
Abstract: I'll describe a method for low-rank approximation of a data matrix arising from single-cell RNA sequencing data. Our basic observation is that such data is consistent with a sparse version of the "spike model" studied in random matrix theory, in which a noise matrix has a low-rank signal added in. As a consequence, the contributions from noise to the output of principal components analysis on this data may be characterized in terms of universal distributions and removed. This is joint work with Luis Aparicio, Mykola Bordyuh, and Raul Rabadan.
Inferring geometric embeddings for single cell data
Abstract: Massively multiplexed sequencing of RNA in individual cells is transforming basic and clinical life sciences. However, in standard experiments, tissues are first dissociated and information about the spatial relationships between cells is lost although this knowledge is crucial for understanding tissue-level function. Recent attempts to overcome this fundamental challenge rely on employing additional in situ gene expression imaging data which can guide spatial mapping of sequenced cells. Here we present a conceptually different approach that allows to reconstruct spatial positions of cells in a variety of tissues without using reference imaging data. We first show for several complex biological systems that distances of single cells in expression space monotonically increase with their physical distances across tissues. We therefore seek to map cells to tissue space such that this principle is optimally preserved, while matching existing imaging data when available. We show that this optimization problem can be cast as a generalized optimal transport problem and solved efficiently. We apply our approach to reconstruct the mammalian liver and intestinal epithelium as well as fly and zebrafish embryos. Our results demonstrate a simple spatial expression organization principle and that this principle can be used to infer, for individual cells, meaningful spatial position probabilities from the sequencing data alone.
Optics-free spatio-genetic imaging with DNA microscopy
Abstract: Complex cell populations, from the brain to the adaptive immune system, rely on diverse gene variants, somatic mutations, and expression patterns for some of their most essential functions. This genetic heterogeneity not only endows intrinsic properties to individual cells, but it also often operates at the level of inter-cellular interactions. Technologies that jointly resolve both gene sequences and the spatial relationships of the cells that express them therefore have a key role to play in deepening our understanding of tissue biology. In this talk, I will introduce DNA microscopy, a new imaging modality that operates by encoding pairwise distances between biomolecules in a sample directly into a DNA sequence library using a stand-alone chemical reaction. I will then present experiments demonstrating that, with these distances encoded, the positions of biomolecules and cells can be computationally inferred by DNA sequence analysis. DNA microscopy requires neither micromanipulation nor specialized equipment and leverages the power of commercial sequencers. Because its imaging power derives entirely from diffusive molecular dynamics, DNA microscopy constitutes a chemically encoded microscopy system.
From Morse theory to geometric ensembling via the topology of PCA
Abstract: We'll start with a visual introduction to Morse theory, which relates the topology (shape) of a manifold (space) to the behavior of smooth, real-valued functions on that manifold. We'll then apply this relationship in both directions. First, we’ll consider the function on the space of k-planes in R^m given by squared distance to a fixed point cloud, leading to a visceral understanding of the gradient dynamics of PCA as learned by a linear autoencoder. Second, we’ll consider the loss function of a deep neural network. We’ll explain how the Morse homology of Euclidean space forces geometric relationships between critical points, establishing a theoretical foundation for fast geometric ensembling that in turn suggests new algorithms. Paper and visualization here.
Regularized linear autoencoders, probabilistic PCA, and backpropagation in the brain
Abstract: Autoencoders are a deep learning model for representation learning. When trained to minimize the Euclidean distance between the data and its reconstruction, linear autoencoders (LAEs) learn the subspace spanned by the top principal directions but cannot learn the principal directions themselves. Here we prove that L2-regularized LAEs learn the principal directions as the left singular vectors of the decoder, providing an extremely simple and scalable algorithm for rank-k SVD. More generally, we consider LAEs with (i) no regularization, (ii) regularization of the composition of the encoder and decoder, and (iii) regularization of the encoder and decoder separately. We relate the minimum of (iii) to the MAP estimate of probabilistic PCA and show that for all critical points the encoder and decoder are transposes. Building on the topological intuition of the primer, we smoothly parameterize the critical manifolds for all three losses via a novel unified framework and illustrate these results empirically. Overall, this work clarifies the relationship between autoencoders and Bayesian models and between regularization and orthogonality. Most excitingly, it suggests a simple, biologically-plausible, and testable resolution of the "weight symmetry problem," namely a local mechanism by which maximizing information flow and minimizing energy expenditure gives rise to backpropagation as the optimization algorithm underlying efficient neural coding. Paper and visualization here.
Abstract:A general, sparse and speculative introduction to statistical models of protein evolution and why some approximations work and others don't. Touching upon conservation, co-evolution, phylogenetics and how we can use these models to "cheat" or avoid the protein folding problem for protein structure determination and design.
End-to-end differentiable learning of protein structure
Abstract:Predicting protein structure from sequence is a central challenge of biochemistry. Co-evolution methods show promise, but an explicit sequence-to-structure map remains elusive. Advances in deep learning that replace complex, human-designed pipelines with differentiable models optimized end-to-end suggest the potential benefits of similarly reformulating structure prediction. Here we report the first end-to-end differentiable model of protein structure. The model couples local and global protein structure via geometric units that optimize global geometry without violating local covalent chemistry. We test our model using two challenging tasks: predicting novel folds without co-evolutionary data and predicting known folds without structural templates. In the first task the model achieves state-of-the-art accuracy and in the second it comes within 1-2Å; competing methods using co-evolution and experimental templates have been refined over many years and it is likely that the differentiable approach has substantial room for further improvement, with applications ranging from drug discovery to protein design.
Abstract: High-dimensional data behave in ways that seem to contradict intuitions from low-dimensional geometry and classical statistics, particularly in detecting and recovering low rank signal. Random matrix theory is a branch of mathematics that characterizes such phenomena; I will sketch a few relevant results.
Controlling for stratification in (meta)-GWAS with PCA: theory, applications, and implications
Abstract: Principal component analysis (PCA) is the standard method for estimating population structure and sample ancestry in genetic datasets. Population structure can induce confounding in genome-wide association studies (GWAS), which is typically addressed by including principal components (PCs) as covariates. However, results from random matrix theory (RMT) predict that PCA fails to detect population differentiation below a particular threshold and that even above the threshold, sample PCs may be only partially correlated with true axes of differentiation. These phenomena depend for each PC on the corresponding eigenvalue; we extend previous work to characterize and interpret the eigenvalues for general population structures. Moreover, we propose an estimator for the effective number of unlinked variants that outperforms previous moments-based estimators, which we then combine with RMT results to estimate the inaccuracy of each PC and predict how this inaccuracy leads to residual confounding in GWAS on stratified phenotypes. We validate our method via downsampling experiments on real data including the UK Biobank and suggest this behavior may be driving the uncorrected stratification recently observed in some large meta-analyses of smaller GWAS.
Learning structure in mouse behavior using Motion Sequencing (MoSeq)
Abstract:Understanding how the brain governs behavior is a fundamental goal of modern neuroscience; however, our ability measure behavior lags far behind our ability to manipulate genes involved in typical brain function, or measure neural activity. Here, we will learn about the general workflow of a novel machine vision and learning technique — called Motion Sequencing (MoSeq) — designed to objectively quantify 3D mouse behavior on a sub-second timescale. Moreover, we will discuss the technique’s current and future uses in addressing questions in behavioral neuroscience.
Using machine learning to understand how the brain implements moment to moment action selection
Abstract:Many naturalistic behaviors are built from stereotyped, modular components that are flexibly arranged to form sequences. Although striatal circuits in the brain have been implicated in action selection and implementation, the neural mechanisms that compose behavior in unrestrained animals are not well understood. I will discuss recently published work on simultaneous recording of neural activity in the direct and indirect pathways of dorsolateral striatum (DLS) and monitoring of 3D pose dynamics as mice spontaneously express action sequences. These experiments demonstrate that DLS neurons systematically encode information about the identity and sequential ordering of stereotyped sub-second 3D behavioral motifs; this encoding is facilitated by fast-timescale decorrelations between the direct and indirect pathways. Furthermore, perturbing the DLS prevents appropriate sequence assembly during both exploratory or odor-evoked behaviors. By characterizing naturalistic behavior at neural timescales, these experiments identify a code for 3D pose dynamics built from complementary pathway dynamics, support a role for DLS in constructing meaningful behavioral sequences, and suggest models for how actions are sculpted over time. I will also discuss unpublished work on using closed-loop recognition of behavioral motifs to study the neural implementation of reinforcement learning.
March 27, 2019
Alejandro Reyes, Yifeng Qi
Aryee Lab, Zhang Lab
Modelling the 3D organization of chromosomes in the cancer cell nucleus
Abstract:3D maps of the human genome have revealed a hierarchical organization of DNA within the nucleus. This organization, including chromatin loops, topologically associated domains (TADs) and active and inactive compartments, has a strong association with transcription. In cancer, the genome is known to undergo profound epigenetic alterations, but the relationship between these changes and 3D organization has not been characterized. We will describe a maximum entropy approach to construct an optimal ensemble of chromosome structures that reproduce a Hi-C contact probability map. This method is highly efficient and can be applied to the whole genome for modeling intra- and inter-chromosomal interactions at a range of scales. It can be generalized to incorporate other experimental data such as imaging. Using these methods we show how to generate nuclear organization maps. We show an example where we have investigated tumor-specific aberrations to 3D genome structure showing widespread disruption of heterochromatin organization in colon cancer.
Understanding biological systems: In search of direct causal mechanisms
Abstract:The advent of DNA-microarrays spurred a vigorous effort to reverse engineer biological networks. Recently, these efforts have been reinvigorated by the availability of RNA-seq data from perturbed and unperturbed single cells. I will discuss the opportunities and limitations of using such data for inferring networks of direct causal interactions, with emphasis on the distinctions between models based on direct and indirect interactions. This discussion motivates the need to model proteins since most biological interactions involve proteins. Then I will introduce key ideas and technological capabilities of high-throughput single-cell proteomics methods that we have developed and will focus on the opportunities of using such data for inferring direct causal mechanisms in biological systems.
Slavov Lab, Northeastern Bioengineering
Primer: Quantifying proteins by mass-spec
Abstract: Mass spectrometry-based proteomics is a suite of high-throughput and sensitive approaches for identifying and quantifying proteins in biological samples. These methods allow for quantifying >10,000 proteins in bulk samples. However, these techniques have not yet been widely applied to single cells despite the fact that modern mass spectrometers can detect single ions. To explain why, this primer talk will explore core concepts of mass spectrometry-based proteomics with emphasis on developing intuition for the physical processes underpinning peptide sequencing and quantification. In particular, I will cover what is called "shotgun" or "discovery" proteomics using isobaric barcoding, a technology used by Single Cell Proteomics by Mass Spectrometry (SCoPE-MS). The primer talk will outline the obstacles that have limited the broad application of quantitative mass spectrometry to single-cell analysis and how SCoPE-MS overcomes these obstacles to enable profiling thousands of proteins across thousands of single cells.
Multi-cause causal inference: challenges and techniques
Abstract: In many scientific endeavors, such as GWAS, we hope to understand how manipulations of a set of inputs (causes) will affect an outcome of interest (effect). I will refer to this problem as "multi-cause causal inference". When we only have observational data, confounding from unobserved variables is a major barrier to understanding these causal relationships. In this talk, I will use simple analytical examples to illustrate the central challenges of this problem. In particular, I will show that approaches to multi-cause causal Inference resembling factor analysis are insufficient to pinpoint causal effects, even with infinite data. I'll then discuss how we can make progress in certain cases by using auxiliary negative control variables, or by shifting focus to interval estimation.
Abstract: In this talk, we'll consider the problem of feature selection using black box predictive models. For example, high-throughput devices in science are routinely used to gather thousands of features for each sample in an experiment. The scientist must then sift through the many candidate features to find explanatory signals in the data, such as which genes are associated with sensitivity to a prospective therapy. Often, predictive models are used for this task: the model is fit, error on held out data is measured, and strong performing models are assumed to have discovered some fundamental properties of the system. A model-specific heuristic is then used to inspect the model parameters and rank important features, with top features reported as ``discoveries.'' However, such heuristics provide no statistical guarantees and can produce unreliable results. Here, I'll present the holdout randomization test (HRT) as a principled approach to feature selection using black box predictive models. The HRT is model agnostic and produces a valid p-value for each feature, enabling control over the false discovery rate (or Type I error) for any predictive model. Time permitting, I'll also discuss how the techniques from the HRT can be adapted to the related, but subtly different, task of interpreting black box model predictions. The talk is based on two papers:
Abstract:One of the least expected findings from systems neuroscience is the "Default Mode Network". This macroscopical brain network has the highest metabolic consumption and the perhaps highest neuronal baseline activity. Functional processing in this network is associated with diverse human-defining psychological processes: complex social cognition, such as perspective-taking, language and moral judgment, as well as the imagination of events and places in future and past. At the same time, the default-mode network has been linked to a range of psychiatric disorders, including schizophrenia, autism and depression. Despite its anthropological significance, the (patho-)physiological function of this network remains essentially unknown. The alternative quantitative approaches into investigating the biology of the DMN will include semisupervised factored logistic regression, extended autoencoder architectures including L1 penalization, hierarchical tree sparsity for regularized high-dimensional prediction, transfer learning in multi-output deep models, and canonical correlation analysis with bootstrapped sensitivity analysis
Drexel Medical School
Primer: Ways of Seeing the Brain's Default Mode Network
Abstract:What happens in the brain when the mind wanders? The surprising discovery of the default mode network came hand-in-hand with a trend toward analyzing intrinsic brain activity. Several imaging modalities have been used to isolate and analyze this network, its normal metabolism, development, and functional anatomy. Activations of the default mode network are associated with self-referential thinking, theories of mind, moral reasoning, and mental time travel, while perturbations of it are associated with psychiatric conditions such as PTSD, depression, autism, and Alzheimer's disease. A new language is emerging which describes both normal and pathological brain functions in terms of distributed networks of connectivity rather than discrete and specialized regions.
Abstract:The era of megasample genomics, where datasets routinely contain millions of sampled genomes, is upon us. Present-day computational methods are fundamentally organised around the variant matrix, where each row describes the observations for every sample at a given genomic location. At megasample-scale, such matrices are massively unwieldy and cannot be processed without complex parallel algorithms. We show that the recently-introduced succinct tree sequence data structure has the potential to hugely reduce storage and processing costs; that it directly encodes important biological signals; and that it has led to efficiency gains of several orders of magnitude in simulation and whole-genome ancestry inference. We examine the underlying algorithmic properties of tree sequences that enable such breakthrough performance gains, and also discuss a preliminary algorithm for exactly solving the Li and Stephens model in logarithmic time.
Primer: Introduction to the tree sequence toolchain
Abstract:The succinct tree sequence data structure is a concise and efficient encoding of whole-genome ancestry and sequence data, with a rapidly maturing software ecosystem. The tskit (tree sequence toolkit) library provides a comprehensive framework for working with tree sequences using Python and C APIs. The ecosystem growing around this central technology now includes several genome simulators as well as our highly-scalable method for inferring ancestry from data, tsinfer. In this primer session, we will introduce tskit and the tree sequence data structure as well as demonstrate both the simulation and inference of genomic datasets in real-time using downloadable Jupyter Notebooks.
Phasing and imputing repeat variants across the genome
Abstract:A fundamental mystery of the genome-wide association study (GWAS) era is the gap between the heritability of phenotypes observed in family studies and the heritability successfully explained by association studies. One often cited source of this "missing heritability" is structural variants, which account for a majority of the base pairs varying among genomes but are usually omitted in GWAS due to the difficulty of genotyping them. In this talk, I will present new methods that enable the extension of GWAS analyses to a certain class of structural variants, variable number tandem repeats (VNTRs). The methods allow phasing of diploid repeat length estimates in whole-genome sequence data and imputation of repeat variants into much larger genotyped cohorts. I will discuss ongoing efforts to apply this approach genome-wide in UK Biobank (N~500K) and incorporate these variants in association studies.
Primer:Hidden Markov models in phasing and imputation
Abstract:Haplotype phasing and imputation have become essential components of genome-wide association analysis pipelines, as these methods allow imputation of genetic variation from smaller whole-genome sequenced reference panels into larger cohorts (genotyped much more sparsely at low cost). Over the past two decades, phasing and imputation methods have undergone several generations of development as sample sizes and variant counts in typical analyses have each increased by five orders of magnitude. I will overview the algorithmic themes that have emerged from these approaches -- many based on the Li-Stephens hidden Markov model -- and discuss the computational considerations that are now informing future directions in this field.
Abstract:Modern biomedical science is defined by noisy high-dimensional data, whether from microscopes (electron, light-sheet, confocal), sequencing (RNA-seq, ATAC-seq, Hi-C), or sensors (physiology, EEG). We present a general framework for denoising high-dimensional measurements which can be applied to any of these domains, and which requires no prior on the signal, no estimate of the noise, and no clean training data. The only assumption is that the noise exhibits statistical independence across different dimensions of the measurement, while the true signal exhibits some correlation. For a broad class of functions ("J-invariant"), it is then possible to estimate the performance of a denoiser from noisy data alone. This allows us to calibrate J-invariant versions of any parameterised denoising algorithm, from the single hyperparameter of a median filter to the millions of weights of a deep neural network. We demonstrate this on natural image and microscopy data, where we exploit noise independence between pixels, and on single-cell gene expression data, where we exploit independence between detections of individual molecules. Finally, we prove a theoretical lower bound on the performance of an optimal denoiser. This framework generalizes recent work on training neural nets from noisy images and on cross-validation for matrix factorization. Preprint here.
Lineage tracing on transcriptional landscapes links state to fate during differentiation
Abstract:A challenge in stem cell biology is to associate molecular differences among progenitor cells with their capacity to generate mature cell types. Dynamic inference from static snapshots provides some insight, but there are fundamental limits on how well dynamics can be inferred from single-cell transcriptomes alone. Here, we use expressed DNA barcodes to clonally trace single cells during differentiation and apply this approach to the study of hematopoiesis. Our analysis identifies functional boundaries of cell potential early in the hematopoietic hierarchy and locates them on a continuous transcriptional landscape. We use our approach to benchmark methods of dynamic inference from single-cell snapshots, and provide evidence of strong early fate biases dependent on cellular properties hidden from single-cell RNA sequencing.
Primer: Dynamic inference from single-cell snapshots
Abstract:Snapshots of single-cell gene expression at a single moment in time encode information about cell state dynamics. But there are challenges to inference: multiple dynamic processes could give rise to the same static snapshot, and the sparsity and high dimensionality of single-cell data make calculations difficult. Using the principle of population balance, we explore the different sources of ambiguity that limit inference from static snapshots, describe the conditions under which dynamics can be determined uniquely, and present an inference algorithm that can calculate these dynamics for sparse high-dimensional data based on spectral graph theory. A key lesson from this approach is that there exists a correspondence between graph-based inference algorithms and models of cell dynamics, which emerges from the correspondence between graph Laplacians and differential operators.
Mechanisms for generalized learning across tasks and environments
Abstract:Current approaches to machine learning may often involve tuning an algorithm to perform well on a specific task, and as such do not represent a general method for learning that could be valuable across many different scenarios. This talk will cover a range of techniques for addressing this problem, including multi-task learning, transfer learning, intrinsic motivation in reinforcement learning, and learning from human preferences. We show how multi-task learning can be used to account for a large degree of heterogeneity between individuals and improve performance in predicting mental health outcomes. Transfer learning can be used to combine training on data with reinforcement learning, to both reduce catastrophic forgetting and improve drug discovery algorithms. Finally, I argue that social learning is an important intrinsic motivator, and show how it can be used in both multi-agent systems and to learn from implicit human preferences.
Personalized HeartSteps: A reinforcement learning algorithm for optimizing physical activity
Abstract: A formidable challenge in designing sequential treatments is to determine when and in which context it is best to deliver treatments. Consider treatment for individuals struggling with chronic health conditions. Operationally designing the sequential treatments involves the construction of decision rules that input current context of an individual and output a recommended treatment. That is, the treatment is adapted to the individual's context; the context may include current health status, current level of social support and current level of adherence for example. Data sets on individuals with records of time-varying context and treatment delivery can be used to inform the construction of the decision rules. There is much interest in personalizing the decision rules, particularly in real time as the individual experiences sequences of treatment. Here we discuss our work to design a reinforcement learning algorithm for use in optimizing physical activity using mobile health.
May 22, 2019
Primer: Generative models from NLP for sequence data
Abstract:Generative models are powerful tools for capturing functional constraints within families of biological sequences. Autoregressive models, developed in natural language processing and related fields, provide a useful approach to modeling sequence data without imposing a rigid alignment structure on the data. In this primer, we will review the math and intuition behind these models, survey advancements in model parameterization, and compare strategies for sampling from the models to generate new sequences. Finally, we will discuss important considerations when applying these models to biological data.
Alignment-free models for protein and antibody design
Abstract:I will describe a set of machine learning methods for protein design with a focus on accelerating antibody discovery for specificity and affinity. I will also motivate these methods for other applications more broadly in genomics.
Antibodies and nanobodies are highly valued molecular tools, used in research for isolating and imaging specific proteins, and in medical applications as therapeutics. However, for a large number of human and model-organism proteins, existing antibodies are non-existent or unreliable. Emerging experimental techniques enable orders-of-magnitude improvement in the number of sequences assayed for target affinity but are notoriously non-specific and not always well-folded. We have explored the use of generative probabilistic models for this design challenge. We found that the high heterogeneity in antibody sequence length poses a fundamental problem for existing methods, and instead exploited model architectures from natural language processing to develop "alignment-free" predictions. We also developed strategies for designing highly diversified libraries based on these models. Finally, we trained these models not just on assayed sequences but on standing evolutionary diversity, taking full advantage of the experiments already done by nature. Small pilot studies were successful, and we have now generated a library with hundreds of thousands of sequences, which is being evaluated experimentally by our collaborators.
Transcriptomic modeling of chemotherapy side effects using human iPSC-derived cardiomyocytes
Abstract: Human iPSC-derived somatic cells provide a powerful, renewable and reproducible tool for modeling cellular responses to external perturbation in vitro, especially for non-blood cell-types such as cardiomyocytes which are extremely challenging to collect and even then are typically only available post-mortem. We investigate using a panel of such cell lines to understand the genetic basis of interindividual differences in response to a specific chemotherapy drug, doxorubicin. Anthracycline-induced cardiotoxicity (ACT) is a key limiting factor in setting optimal chemotherapy regimes, with almost half of patients expected to develop congestive heart failure given high doses. However, the genetic basis of sensitivity to anthracyclines remains unclear. We created a panel of human iPSC-derived cardiomyocytes from 45 individuals and performed RNA-seq after 24h exposure to varying doxorubicin dosages. The transcriptomic response is substantial: the majority of genes are differentially expressed and over 6000 genes show evidence of differential splicing, the later driven by reduced splicing fidelity in the presence of doxorubicin. We show that inter-individual variation in transcriptional response is predictive of in vitro cell damage, which in turn is associated with in vivo ACT risk. We developed an efficient linear mixed model, suez, which detects 447 response-expression quantitative trait loci (QTLs). Combining suez with our RNA splicing quantification algorithm LeafCutter we find 42 response-splicing QTLs. These molecular response QTLs are enriched for lower p-values in ACT genome-wide association and enable prediction of cellular damage, supporting the in vivo relevance of our map of genetic regulation of cellular response to anthracyclines.
UPenn; previously MIT Math, CSAIL, Synthetic Neurobiology
Mapping the brain with machine learning
Abstract: Neuroscientists have long sought a "connectome" - a wiring diagram for the brain. The complexity and sheer number of neurons make it impractical to map a brain by hand, but an automated approach is increasingly possible. In this talk, I will present an overview of algorithms for tracing individual neurons and their connectivity from microscope images of brain tissue. In particular, I will discuss deep learning methods for image segmentation, the kinds of errors such algorithms make, and approaches for fixing these errors without human intervention.
Primer: Why is deep learning so deep?
Abstract: Deep neural networks are more powerful than shallow ones, but can be harder to train. In this primer, we will show mathematically why both of these statements are true. Specifically, we will see that depth leads to an exponentially greater ability to express even simple polynomial functions. We will identify why some initializations and architectures impede learning in deeper networks, and demonstrate (both theoretically and empirically) several principles to bear in mind when designing a deep neural network that will learn effectively.
Abstract: In this talk I’ll describe a novel approach called adaptive sampling that yields algorithms whose parallel running time is exponentially faster for a broad range of machine learning applications. The algorithms are designed for submodular function maximization which is the algorithmic engine behind applications such as clustering, network analysis, feature selection, Bayesian inference, ranking, speech and document summarization, recommendation systems, hyperparameter tuning, and many others. Since applications of submodular functions are ubiquitous across machine learning and data sets become larger, there is consistent demand for accelerating submodular maximization algorithms. In this talk I’ll introduce the adaptive sampling framework we recently developed and present experimental results from computational biology as well as other application domains.
Primer: Submodular maximization and machine learning
Abstract: Submodular functions are ubiquitous in machine learning, with applications ranging from item recommendation systems to feature selection, clustering, network analysis, document summarization, and Bayesian optimization. These functions capture a key property that is common to many problems: we experience diminishing returns as we select additional items (or features, clusters, nodes, keyphrases, etc.). In this talk, we will survey submodular functions in a variety of salient applications and show why maximizing such functions is challenging. We will then describe simple approximation algorithms with provably optimal guarantees and glimpse into cutting edge research.
Abstract: Although open sharing of genomic or pharmacological data would greatly advance science, it is generally not viable due to data privacy and intellectual property concerns. Building upon modern cryptographic tools, we introduce privacy-preserving computational protocols that could encourage data sharing and collaboration in biomedicine. First, we describe the first scalable and secure protocol for large-scale genome-wide association analysis that facilitates quality control and population stratification correction while maintaining the confidentiality of underlying genotypes and phenotypes. We show the protocol could feasibly scale to a million individuals. Second, we introduce a protocol for securely training a neural network model of drug-target interaction (DTI) that ensures the confidentiality of all underlying drugs, targets, and observed interactions. Our protocol scales to a real dataset of more than a million interactions, and is more accurate than state-of-the-art DTI prediction methods. Using our protocol, we discover novel DTIs that we experimentally validated via targeted assays. Our work lays a foundation for more effective and cooperative biomedical research.
Manifold learning yields insight into cellular state space under complex experimental conditions
Abstract: Recent advances in single-cell technologies enable deep insights into cellular development, gene regulation, cell fate and phenotypic diversity. While these technologies hold great potential for improving our understanding of cellular state space, they also pose new challenges in terms of scale, complexity, noise, measurement artifact which require advanced mathematical and algorithmic tools to extract underlying biological signals. Further as experimental designs become more complex, there are multiple samples (patients) or conditions under which single-cell RNA sequencing datasets are generated and must be batch corrected and the corresponding populations of single cells compared. In this talk, I cover one of most promising techniques to tackle these problems: manifold learning. Manifold learning provides a powerful structure for algorithmic approaches to denoise the data, visualize the data and understand progressions, clusters and other regulatory patterns, as well as correct for batch effects to unify data. I will cover two alternative approaches to manifold learning, graph signal processing (GSP) and deep learning (DL), and show results in several projects including: 1) MAGIC (Markov Affinity-based Graph Imputation of Cells): an algorithm that low-pass filters data after learning a data graph, for denoising and transcript recover of single cells, validated on HMLE breast cancer cells undergoing an epithelial-to-mesenchymal transition. 2) PHATE (Potential of Heat-diffusion Affinity-based Transition Embedding): a visualization technique that offers an alternative to tSNE in that it preserves local and global structures, clusters as well as progressions using an information-theoretic distance between diffusion probabilities. 3) MELD (Manifold-enhancement of latent variables): an analysis technique that filters the experimental label on the graph learned from single-cell data in order to boost experimental signal and associated correlations. 4) SAUCIE (Sparse AutoEncoders for Clustering Imputation and Embedding), our highly scalable neural network architecture that simultaneously performs denoising, batch normalization, clustering and visualization via custom regularizations on different hidden layers. We demonstrate the power of SAUCIE on a massive single-cell dataset consisting of 180 samples of PBMCs from Dengue patients, with a total of 20 million cells. We find that SAUCIE performs all the above tasks efficiently and can further be used for stratifying patients themselves on the basis of their single cell populations. Finally, I will preview ongoing work in neural network architectures for predicting dynamics and other biological tasks.
Primer: Manifold learning and graph signal processing of high-dimensional, high-throughput biological data
Abstract: The primer will go over graph and graph-diffusion based methods for manifold learning including diffusion maps and our new method PHATE (potential of heat-diffusion affinity-based transition embedding). We will also introduce graph signal processing and the general concept of treating measurements as signals on a cell-cell graph. We will show the utility of this view in our techniques such as MAGIC (markov affinity-based graph imputation of cells) for data denoising and imputation, and MELD (manifold-enhancement of latent dimensions) for enhancing latent experimental signals and performing causal inference on drivers of experimental differences.
October 24, 2018
Regev and Lander labs, Broad
Studying cell and tissue physiology with random composite experiments
Abstract: In this talk I describe how compressed sensing can be applied to greatly accelerate the study of cell physiology. Specifically, it has the potential to transform work in two distinct fields, histology and genetics, using a common approach: composite experiments. With these approaches, methods that can today image 10-100 genes in a sample will be leveraged to measure thousands of genes, and genetic perturbation studies can be scaled by 1,000-10,000 fold. I will describe both the theoretical underpinnings of these approaches, as well as the status of ongoing efforts to implement composite experiments in the lab.
Lightning Talk Social
Miriam Shiffman, Broderick Lab, MIT CSB and Regev Lab, Broad: We develop a full generative model and inference algorithm for reconstructing probabilistic trees of cellular differentiation from single-cell RNA-seq data. A central innovation is the development of a new class of Bayesian tree models for data that arise from continuous evolution along a latent nonparametric tree.
Miriam Udler, Florez Lab, MGH/HMS: Complex diseases like type 2 diabetes (T2D) are thought to be caused by multiple contributing genetic and environmental processes. We have recently identified five key genetic pathways impacting T2D risk, and I am interested in whether we can use these pathways along with other clinical data to improve the classification (and ultimately management) of patients with T2D.
Brian Trippe, Broderick Lab, MIT CSB: Generalized linear models and Bayesian inference provide a powerful toolkit for building interpretable models with coherent quantification of uncertainty, but are often computationally expensive to use on high-dimensional datasets. We present an approximation method which enables more efficient, accurate inference with theoretical guarantees on quality.
Eli Weinstein, Marks Lab, Harvard Biophysics: The massive increase in genetic sequence data from diverse, uncultured microorganisms offers opportunities for the discovery of novel and useful molecular systems. I'll describe computational methods for finding genetic loci that are modular or programmable; our approach does not depend on identifying homology to previously characterized systems, relying instead on inference of sequence models and statistical tests for diversity.
Engelhardt Lab, Princeton CS and Quantitative and Comp. Bio.
Experimental design for maximizing cell type discovery in single-cell RNA-seq data
Abstract: Bandit algorithms are often the tool of choice for recommendation engines, and have recently seen applications in the context of medical health care data. Here, inspired by bandit ideas, we show a novel application to iterative experimental design in multi-tissue single-cell RNA-seq (scRNA-seq) data. We present two algorithms, a Good-Toulmin like estimator via Thompson sampling (joint work with Karen Feng and Barbara Engelhardt) and an extension involving a Pitman-Yor prior (joint work with Federico Ferrari and Stefano Favaro). Given a budget and modeling cell type information across tissues, they both estimate how many cells are required for sampling from each tissue with the goal of maximizing cell type discovery across samples from multiple iterations. In both real and simulated data, we demonstrate the advantages these algorithms provide in data collection planning when compared to a random strategy in the absence of experimental design.
Engelhardt Lab, Princeton CS and Quantitative and Comp. Bio.
Primer: Robust nonlinear manifold learning for single cell RNA-seq data
Abstract: Analysis of single cell RNA sequencing (scRNA-seq) experiments requires dimension reduction for regularization and efficiency. We present a nonlinear latent variable model with robust, heavy-tail error modeling and adaptive kernel learning to capture low dimensional nonlinear structure in scRNA-seq data. Gene expression is modeled as a noisy draw from a Gaussian process in high dimensions from latent positions, known as a Gaussian Process Latent Variable Model (GPLVM). We model residual errors with a heavy-tailed Student's t-distribution to control for observed technical and biological noise. We compare our approach to common dimension reduction tools to highlight our model's ability to enable important downstream tasks, including clustering and inferring cell developmental trajectories, on available experimental data. We show that our robust nonlinear manifold is well suited for raw, unfiltered gene counts from high throughput sequencing technologies for visualization and exploration of cell states.
Topic modeling the transcriptional spectrum in innate lymphoid cells
Abstract: Analyses of immune cell classes, such as innate lymphoid cells (ILCs) or T helper cells, typically treat them as collections of discrete immune cell “types”. Yet, these cell types may share important biological signals and have been observed in some contexts to essentially continuously span a functional spectrum. In single-cell RNA-seq data from skin-resident ILCs, we observed a multi-dimensional spectrum of ILCs that was shifted and functionally reconfigured by induction of psoriasis. To capture and explore these fluid, mixed transcriptional states, we used topic modeling by latent Dirichlet allocation (LDA), a method (covered in the great primer David will give!) designed to analyze the words in a corpus of text documents to discover the themes, or topics, that pervade them. Through an analogy between document analysis and single-cell analysis, we applied LDA to discover each cell’s multiple, non-hierarchical “identities”, and their relative importance, and used these features to analyze cellular plasticity during inflammatory response. Topic weights captured relationships not well described by clusters and, through their functional interpretation, enabled a more nuanced view of similarities among cells. There was no apparent “pseudo-time axis" of progression across steady-state cell states, but a temporal “induction” dimension in our data was revealed when we focused on specific topics related to immune repression or activation. Using experimental techniques in a mouse model, we validated several computational predictions, including the previously undescribed presence of quiescent-like tissue-resident ILCs and differentiation of activated skin-resident ILC2s into pathological ILC3s. Approaches like topic modeling should be valuable in representing other continuous cell states and in uncovering dynamic cellular activation in response to a stimulus.
Data Sciences Platform, Broad
Primer: Intro to topic models
Abstract: Starting from a ridiculously simple language model we will build up the prototypical topic model, Latent Dirichlet Allocation (LDA), piece by piece. We will discuss why LDA works and ways to elaborate upon it. Finally, we will survey applications of LDA in biology.
Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq
Abstract: While matrix factorizations such as PCA or ICA are commonly used for dimensionality reduction of single-cell RNA-Seq data, the dimensions they infer may not necessarily align with biologically meaningful gene expression programs and are frequently ignored in practice. Here, I will discuss analysis of real and simulated single-cell data showing that matrix factorization can yield components corresponding to cell types and cellular activities such as life-cycle processes or responses to environmental stimuli. However, one limitation of many matrix factorizations is that their stochastic optimization algorithms can yield variable solutions when run multiple times on the same dataset which reduces the interpretability of the result. To address this limitation, we developed a meta-analysis approach that we call consensus matrix factorization which averages over multiple replicates to increase the robustness of the solution. We show with simulated data that, in particular, the consensus implementation of NMF (cNMF) outperforms several other factorizations at inferring cell-type and activity programs, including the relative contribution of programs in each cell. Applied to published brain organoid and visual cortex single-cell RNA-Seq datasets, cNMF refines the hierarchy of cell types and identifies both expected (e.g. cell-cycle and hypoxia) and intriguing novel activity programs. We make cNMF available to the community and illustrate how this approach can provide key insights into gene expression variation within and between cell types.
Primer: Intro to non-negative matrix factorization
Abstract: Dimensionality reduction is essential for extracting generalizable knowledge from noisy, high-dimensional data. While singular value decomposition (SVD, PCA) is optimal with respect to minimizing data movement, the resulting features are often not interpretable or robust across experiments. Non-negative matrix factorization (NMF) is a powerful alternative that may be applied when the data is non-negative (e.g, counts or concentrations of biological molecules!). In this primer, we will formulate an NMF objective function and optimization algorithm, paying special attention to practical challenges that Dylan will explore in the main talk. We will discuss biomedical applications of NMF, including to spatially-resolved RNA-seq data. And time permitting, we will survey familiar probabilistic models built on NMF, such as the topic models from last week.
MGH, HMS, Broad & MIT Bioengineering; Pinello, Joung, Collins Labs
A deconvolution framework for the analysis of CRISPR tiling screen data
Abstract: The advent of programmable genome editing using CRISPR-based technologies has allowed for high-throughput functional interrogation of non-coding elements throughout the genome. Functional mapping can be achieved by densely tiling single guide RNAs (sgRNAs) across a non-coding region of interest, where each sgRNA enables linking of a unique genomic location to an observable phenotype. Here we present CRISPR-SURF, a generalizable deconvolution framework, to discover and dissect non-coding regulatory elements from the analysis of CRISPR tiling screen data. Luca Pinello will open up the talk by motivating why people are excited about CRISPR tiling screen and describing the key ideas and challenges. Jonathan Hsu will dive into the details of the proposed deconvolution framework - the method at the heart of CRISPR-SURF - and discuss an efficient implementation for it. Finally, we will discuss future directions for the use of CRISPR-Cas tiling screens.
Single-cell trajectory reconstruction, exploration and mapping from omics data
Abstract: Single-cell transcriptomic assays have enabled the de novo reconstruction of lineage differentiation trajectories, along with the characterization of cellular heterogeneity and state transitions. Several methods have been developed for reconstructing developmental trajectories from single-cell transcriptomic data, but efforts on analyzing single-cell epigenomic data and on trajectory visualization remain limited. Here we present STREAM, an interactive pipeline capable of disentangling and visualizing complex branching trajectories from both single-cell transcriptomic and epigenomic data. First, Luca Pinello will set the stage presenting the basic concepts of how to build a trajectory inference approach from scratch (a cookbook perspective). Then Huidong Chen will describe the method behind STREAM - a novel Elastic Principal Graph implementation (ElPiGraph), followed by a detailed discussion of how to visualize the learned trajectory and how to discover branch-specific genes, or genes differentiating between trajectory branches. We will close off with examining what we have learned so far and what the future directions and challenges are.
Using Knockoffs to Find Important Variables with Statistical Guarantees
Abstract: Despite the significant recent progress in high-dimensional variable selection (reviewed in the primer), it remains unclear how to powerfully select important variables while controlling the fraction of false discoveries, even in simple models like logistic regression, not to mention general high-dimensional nonlinear models. To address this practical problem, we propose a new framework of model-X knockoffs, which acts as a wrapper around any (arbitrarily complex, e.g., drawn from machine learning) measure of variable importance and identifies important variables while exactly controlling the false discovery rate. Our method relies only on a model for the explanatory variables X, and in fact makes no assumptions at all about the response variable's distribution. To our knowledge, no other procedure solves the FDR-controlled variable selection problem in such generality, but in the restricted settings where competitors exist we demonstrate the superior power of knockoffs through simulations. We also demonstrate model-X knockoffs on GWAS data from a case-control study of Crohn’s disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data.
Primer: Challenges in High-Dimensional Variable Selection
Abstract: Identifying relevant features to explain a response variable has always been an important problem in many areas of science. As data sets become more complex, the number of candidate features is quickly growing and very often even exceeds the number of observations we can afford to collect. This brings huge challenges for statisticians and scientists, as traditional variable selection methods fail in these cases. This talk reviews these challenges and existing statistical methods to address them. We will discuss the advantages and disadvantages of those methods, ultimately motivating the novel approach presented in the main talk.
Spring 2018 Schedule: 9:00am Primer, 9:50am Breakfast, 10:00am Seminar, 11:00am Discussion; all in Monadnock
Institute for Medical Sciences and Engineering and Department of Civil and Environmental Engineering, MIT
Rapid bacterial adaptation within individual human microbiomes
Abstract: We tend to think about human microbiomes as communities of static species. The degree to which individual commensal species functionally change within individual people has remained elusive, as it is difficult to identify de novo mutations from metagenomic data of mixed communities. We have recently discovered that commensal members of our microbiota acquire de novo mutations with strong fitness consequences within individual people. In this talk, I will discuss the challenges with identifying within-person evolution from metagenomics alone, describe how culture-dependent methods enable powerful evolutionary inferences, and touch on the implications of an evolving microbiome for data interpretation and therapy.
Abstract: Recent success stories of using machine learning for diagnosing skin cancer, diabetic retinopathy, pneumonia, and breast cancer may give the impression that artificial intelligence (AI) is on the cusp of radically changing all aspects of health care. However, many of the most important problems, such as predicting disease progression, personalizing treatment to the individual, drug discovery, and finding optimal treatment policies, all require a fundamentally different way of thinking. Specifically, these problems require a focus on *causality* rather than simply prediction. Motivated by these challenges, my lab has been developing several new approaches for causal inference from observational data. In this talk, I describe our recent work on the deep Markov model (Krishnan, Shalit, Sontag AAAI '17) and TARNet (Shalit, Johansson, Sontag, ICML '17).
Abstract: A common goal in science is to use knowledge gained by observing a phenomenon of interest to guide decision and policy making. If smokers are observed to have higher rates of lung cancer, should we legislate to discourage smoking? Such a policy will only be effective if smoking itself is the cause of cancer and the correlation between cancer rate and smoking is not explained by other factors, such as lifestyle choices. Problems like these are well described in the language of causal inference. In this primer, we explain the difference between statistical and causal reasoning, and introduce the notions of confounding, causal graphs and counterfactuals. We cover the problem of estimating causal effects from experimental and observational data, as well as sufficient assumptions to make causal statements based on statistical quantities.
Brigham and Women's Hospital, Harvard Medical School; Broad Institute
Leveraging long range phasing to detect mosaicism in blood at ultra-low allelic fractions
Abstract: What would you do with 500,000 near-perfectly phased genomes? One answer (we welcome others!): harness this information to detect subtle imbalances between maternal and paternal allelic fractions in blood DNA -- the hallmark of clonal mosaic chromosomal alterations [1,2]. In this talk, we will describe how we phased the UK Biobank to chromosome-scale accuracy [3,4], developed HMM-based machinery to sensitively call mosaic alterations, and probed the data to reveal new insights into the causes and consequences of clonal hematopoiesis. 
Linking gut microbiomes, genomes and phenotypes via linear mixed models and kernel methods
Abstract: The gut microbiome is increasingly recognized as having fundamental roles in human physiology and health, and is often referred to as our second genome. However, the associations between microbiome, our genome, our environment and our health are not well understood. I will discuss our recent work to elucidate these relations, using a cohort of ~1,000 Israeli individuals with detailed microbiome, genotype, clinical and environmental measurements, with an emphasis on methods to handle the large dimensionality and heterogeneity of such data. Our approaches combine linear mixed models – the statistical backbone of GWAS and phenotype prediction methods – with common techniques from statistical ecology, and with kernel regression approaches from machine learning.
In the first part of the talk, I will describe approaches to investigate the role of host genetics in shaping the gut microbiome. In the second part, I will describe approaches to investigate how host genetics and the microbiome interact with traits such as obesity and glucose levels. I will show that the fraction of phenotypic variance explained by the microbiome is often comparable to that of host genetics, which provides a positive outlook towards microbiome-based therapeutics of metabolic disorders.
This is a joint work with Daphna Rothschild and Elad Barkan from Eran Segal's group at the Weizmann Institute of Science. It has recently been accepted for publication in Nature. [preprint]
Abstract: We have a variety of linear methods for data analysis and machine learning that are familiar & intuitive, but our data are often nonlinear in complicated ways, or come in a form where the idea of "linear" doesn't have an obvious meaning, such as DNA sequences or graphs based on protein interactions.
Kernel methods allow us to apply some of our familiar linear tools to nonlinear and structured data, using similarities between data points as the basis for classification, regression, and other analyses like PCA. I'll explain the "kernel trick" as a principled way to extend linear methods to work with similarities, talk about algorithms based on kernels (support vector machines, support vector regression, & kernelized PCA), introduce example kernels for a variety of data types (e.g., vectors, graphs, strings), and discuss approximations that allow kernels to be applied to very large datasets.
Inferring Microbial Phenotypes through Latent Representations of Biological Diversity
Abstract: Creating orderly representations of the vast diversity of microbial life is an ancient problem. The spread of mass sequencing has revealed the inability of the Linnean system both to identify organisms and to capture the variation between them. Marker sequences, such as 16S rRNA in Bacteria and ITS in Fungi, have been used to identify taxa, albeit crudely at times. Identification with a sequence allows for the explicit modeling of distance between organisms and ordination of the resulting phylogenetic distance space. We study the use of a common topic model, LDA (Latent Dirichlet Allocation), to capture the differences between sequences. We show that distance in the latent space of topics reproduces alignment distance between closely related taxa. Additionally, we find that the dimensions of this space reflect the hierarchy of biological relationships. This transformation allows for fast comparison of taxa and gaussian process modeling of the properties of unsequenced strains and phenotypic interpolation based on their neighbors. These results represent a comprehensive and extensible methodology for the modeling of biological diversity.
Emmanuel College Mathematics Department; Harvard Program for Evolutionary Dynamics
Evolutionary dynamics on any population structure
Abstract: The evolution of social behavior can be modeled using evolutionary game theory. Population structure, which can be represented as a graph or network, affects which traits evolve. Understanding evolutionary game dynamics in heterogeneously structured populations is difficult. For arbitrary selection intensity, the problem is in a computational complexity class which suggests there is no efficient algorithm. I will present recently published work that provides a solution for weak selection, which applies to any graph or social network. The method uses coalescent theory and relies on calculating the meeting times of random walks. The method is used to evaluate large numbers of diverse and heterogeneous population structures for their propensity to favor cooperation. I will demonstrate how small changes in population structure---graph surgery---affect evolutionary outcomes. It turns out that cooperation flourishes most in societies that are based on strong pairwise ties.
March 28, 2018
Data Sciences Platform
Variant Filtering and Calling with Convolutional Neural Networks
Abstract: Convolutional Neural Networks (CNNs) process the reference genome and aligned reads covering sites of genetic variation encoded as numeric tensors. Convolutions over these tensors learn to detect motifs useful for variant filtering and calling. Variant filtering models learn to classify variants as artifact or real. Variant calling models learn to segment genomic positions into the diploid genotypes. We will demonstrate how these models can integrate summary statistic information for faster training and potential applications in unsupervised learning. We will also explore several hyper-parameter optimization strategies for architecture selection. Improvements in both sensitivity and precision with respect to current state-of-the-art filtration methods like gaussian mixture models, random forests, and deep variant will be presented.
April 4, 2018
Bonneau Lab, NYU
Multitask learning approaches to biological network inference: linking model estimation across diverse related datasets
Abstract: Due to increasing availability of biological data, methods to properly integrate data generated across the globe become essential for extracting reproducible insights into relevant research questions. We developed a framework to reconstruct gene regulatory networks from expression datasets generated in separate studies — and thus, because of technical variation (different dates, handlers, laboratories, protocols etc…), challenging to integrate. In this talk, I will introduce how we currently learn regulatory networks from gene expression data, and then, how we extend our methods to learn multiple networks from related datasets jointly through multitask learning. In particular, our method aims to be able to detect weaker patterns that are conserved across datasets, while also being able to detect dataset-unique interactions. In addition, adaptive penalties may be used to favor models that include interactions derived from multiple sources of prior knowledge including orthogonal genomics experiments. Since underlying regulatory mechanisms are often shared across conditions and/or cohorts, we hypothesized that multitask approaches, where conclusions are drawn from various data sources, would improve performance of network inference. Using two unicellular model organisms, we show that joint network inference outperforms inference from a single dataset. Finally, we also demonstrate that our method is robust to false edges in the prior and to low condition overlap across datasets. Because of the increasing practice of data sharing in Biology, we speculate that cross-study inference methods will be largely valuable in the near future, increasing our ability to learn more robust and generalizable hypotheses and concepts.
Center for Genomics and Systems Biology, New York University
Primer: Inference of biological networks with biophysically motivated methods
Abstract: Via a confluence of genomic technology and computational developments the possibility of network inference methods that automatically learn large comprehensive models of cellular regulation is closer than ever. This talk will focus on enumerating the elements of computational strategies that, when coupled to appropriate experimental designs, can lead to accurate large-scale models of chromatin-state and transcriptional regulatory structure and dynamics. We highlight four research questions that require further investigation in order to make progress in network inference: using overall constraints on network structure like sparsity, use of informative priors and data integration to constrain individual model parameters, estimation of latent regulatory factor activity under varying cell conditions, and new methods for learning and modeling regulatory factor interactions. We conclude with examples of applying this strategy to: 1) human and mouse lymphocyte development and function and 2) inference from single-cell and spacial transcriptomics aimed at healthy and diseased brain and spinal tissues.
Learning protein structure with a differentiable simulator
Abstract: While the problem of predicting protein structure from sequence is among the oldest in computational biology, current methods leave a significant fraction of the protein universe out of reach. Standard methodology involves two steps: (1) defining an energy landscape, whether with physics, statistics, or homology, and (2) sampling low-energy conformations. Often, even "correct” energy landscapes that assign the lowest energy to the correct structure will not generate it as a prediction, because the conformational sampling algorithm cannot find it. We have been developing an alternative approach to bridge this gap by directly training energy landscapes in tandem with the conformational sampling algorithms that operate on them. I will talk about this approach, backpropagation through simulators in general, and how we built a deep neural energy function that is trained by backpropagating through the *entire* protein folding process.
Program for Evolutionary Dynamics, Harvard University
Abstract: Biological evolution describes how populations of individuals change over time. The three fundamental principles of evolution are mutation, selection and cooperation. I will present the mathematical formalism of evolution focussing on stochastic processes. I will discuss amplifiers and suppressors of natural selection, evolutionary game theory and evolutionary graph theory.
Program for Evolutionary Dynamics, Harvard University
Primer: Hamilton's rule makes no prediction and cannot be tested empirically
Abstract: Hamilton's rule is a well-known concept in evolutionary biology. It states that a social trait is favored by natural selection if BR>C, where B is the benefit for the recipient, C the cost for the donor and R the relatedness between donor and recipient. It is often perceived as a statement that makes predictions about natural selection in situations where interactions occur between genetic relatives. It turns out that this view is incorrect. A simple mathematical analysis reveals that "exact and general'' formulation of Hamilton's rule, which is widely endorsed by its proponents, is not a consequence of natural selection and not even a statement specifically about biology. Instead it is a relationship among slopes of linear regression that holds for any suitable data set. It follows that the general form of Hamilton's rule makes no predictions and cannot be tested empirically.
May 2, 2018
Shantanu Singh, Jane Hung, Juan Caicedo, Mohammad Rohban
How to make a picture worth a thousand numbers: models and methods in biological image analysis
Abstract: Our group (Carpenter lab / Imaging Platform) develops CellProfiler – a widely-used bioimage analysis software, and has pioneered image-based profiling – an approach to create signatures of cell populations using high-throughput microscopy imaging. In this talk, we will present approaches we've been developing to create a new generation of these tools and methods. This talk will include four quick vignettes:
- A deep learning-based tool and library for cell detection, applied to nucleus detection and malaria stage classification.
- An extensive evaluation of two convnet models for nucleus segmentation, as well as a sneak peak into the results of the 2018 Kaggle Data Science Bowl challenge we organized.
- A new approach for creating morphological profiles by training convnets using weak labels.
- A new approach for creating cell population profiles which capture single-cell heterogeneity.
Abstract: Data science and data engineering often involve complex pipelines, and it’s not transparent where flaws are introduced. I will discuss the (simple) idea of AI audit, where we leverage the predictive power of machine learning to systematically perform quality control of various components of the data pipeline. I will illustrate this framework with three diverse examples: integrating single-cell RNA-Seqs, designing new proteins, and word embeddings.
Abstract: Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely-used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, in many settings we have datasets collected in different conditions, e.g. a treatment and a control experiment, and we are interested in visualizing and exploring patterns that are specific to one dataset. We propose a new method, contrastive principal component analysis (cPCA), which identifies low-dimensional structures that are enriched in a dataset relative to comparison data. In a wide variety of experiments, we demonstrate that cPCA with a background dataset enables us to visualize dataset-specific patterns missed by PCA and other standard methods. We further provide a geometric interpretation of cPCA and strong mathematical guarantees. An implementation of cPCA is publicly available, and can be used for exploratory data analysis in applications where PCA is currently used.
Abstract: As the number and size of sequencing-based experiments grows, biologists and machine learning researchers can apply more complicated models to the prediction of the biological effects of variation in sequence. Interpreting these models is a challenge. I will discuss some of the methods we have applied to understanding sequence-based models, in particular, what we can learn with minimal understanding of the internals of the model itself.
Lander Lab, Broad Institute
Primer: How philosophy of science can help us better deploy machine learning in biology
Abstract: The scope of machine learning applications have increased dramatically in recent years and have captured the attention of many biological researchers. Will new efforts in machine learning only enable new predictive capabilities or could these tools contribute to new ways of observing and reasoning about biological systems? This primer explores concepts from philosophy of science that can help to orient researchers and evaluate how machine learning may add explanatory power to models of biology in addition to improving prediction.
Departments of Computer Science, University of Copenhagen and IT University of Copenhagen
Convolutional models of molecular structure
Abstract: Although originally devised in the field of image analysis, convolutional neural networks (CNNs) are increasingly finding applications outside the image domain. In particular, a number of studies over the last year have made a convincing case for the use of CNNs within the field of molecular modelling. As an example of these recent developments, we will present our work on using convolutions to predict mutation-induced changes-of-stability (ddgs) in proteins. We will demonstrate how a simple convolutional model using a purely data-driven approach achieves performance comparable to that of state-of-the-art methods in the field. Finally, we will discuss current theoretical developments in the area of convolutions, including the quest for rotational equivariance.
Departments of Computer Science, University of Copenhagen and IT University of Copenhagen
Primer: Learning from molecular structure
Abstract: Purely data-driven modelling techniques have had a fundamental impact on the analysis of biological sequences. In particular, neural networks have been used extensively, with successful applications in for instance the prediction of secondary structure, aggregation propensities, and disorder. In contrast, the 3D structure of molecules has been modelled almost exclusively with carefully parameterised physical force fields, which are notoriously difficult to optimise from data. Recent developments in Machine Learning are changing this picture, making it possible to learn structure-sequence relationships directly from raw molecular structures. In this primer, we will briefly review these developments, and introduce the concept of convolutional neural networks, which form the basis for many of the current activities, including the work we will present as our main talk.
Fall 2017 Schedule: 9:00am Primer, 9:50am Breakfast, 10:00am Seminar, 11:00am Discussion; all in Monadnock
Reading the rules of gene regulation from the human noncoding genome
Abstract: Functional genomics approaches to better model genotype-phenotype relationships have important applications toward understanding genomic function and improving human health. In particular, thousands of noncoding loci associated with diseases and physical traits lack mechanistic explanation. I'll present a machine-learning system to predict cell type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. Using convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. I'll show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable GWAS loci fine mapping.
Broad Data Sciences Platform
Primer: Classifying genomic sequences with convolutional neural networks
Abstract: Initially developed for image processing, Convolutional Neural Networks (CNNs) have been applied to genomic data with promising results. This primer will trace some of the history of neural networks with an eye towards the practical lessons learnt along the way. Then building on the idea of the Position Weight Matrix as a motif detector we will explore exactly what convolution means when applied to a DNA sequence. While drawing examples from computer vision and natural language processing, our focus will be on the application of CNNs to genomic data. Lastly, we will cover recent advances in CNNs including residual connections and dilated convolutions.
Detecting effects of transcription factors on disease
Abstract: Learning biology using GWAS data frequently involves identifying genomic regions involved in a biological process and assessing for enrichment of GWAS signal in those regions. But in some cases, e.g., binding of a transcription factor (TF), improving models and growing data sets allow us to estimate in a signed way whether genetic variants promote or hinder a biological process. I'll present a new method, signed LD profile regression, for combining this type of information with GWAS data to draw relatively strong inferences about trait mechanism. I'll then describe how this method can be applied in conjunction with signed genomic annotations reflecting binding of ~100 TFs in various cell lines generated using a convolutional neural network, Basset. Finally, I'll discuss some results from applying our method to GWAS data about a range of traits including gene expression, epigenetic traits, and several diseases.
Abstract: Linear models are a very common choice when modeling the relation between inputs and outputs because of their simplicity and interpretability. We will explore methods for parameter estimation in these models, with an eye toward understanding some of the more advanced techniques. We will start by reviewing the most commonly used estimator: the ordinary least squares (OLS) estimator. Then we will explore some limitations of the OLS estimator when the residuals are not i.i.d. and discuss how to overcome these limitations, first with with weighted least squares and then with generalized least squares. We'll close by discussing linear models in the context of genome-wide association studies (GWAS) as a lead-in to the talk.
Abstract: f-statistics are now a commonplace tool in population genetics, used to discover and test models for demographic history. We describe the theory and ADMIXTOOLS, a package that implements most of the tests used. We give a number of examples of discoveries about human history made using f-statistics and conclude with some things we would like to do better and some open questions. [paper]
Learning developmental landscapes from single-cell gene expression with optimal transport
Abstract: Understanding the molecular programs that guide cellular differentiation during development is a major goal of modern biology. Here, we introduce an approach, WADDINGTON-OT, based on the mathematics of optimal transport, for inferring developmental landscapes, probabilistic cellular fates and dynamic trajectories from large-scale single-cell RNA-seq (scRNA-seq) data collected along a time course. We demonstrate the power of WADDINGTON-OT by applying the approach to study 65,781 scRNA-seq profiles collected at 10 time points over 16 days during reprogramming of fibroblasts to iPSCs. We construct a high-resolution map of reprogramming that rediscovers known features; uncovers new alternative cell fates including neural- and placental-like cells; predicts the origin and fate of any cell class; highlights senescent-like cells that may support reprogramming through paracrine signaling; and implicates regulatory models in particular trajectories. Of these findings, we highlight Obox6, which we experimentally show enhances reprogramming efficiency. Our approach provides a general framework for investigating cellular differentiation. [paper]
Abstract: The optimal transport (OT) problem is often described as that of finding the most efficient way of moving a pile of dirt from one configuration to another. Once stated formally, OT provides extremely useful tools for comparing, interpolating and processing objects such as distributions of mass, probability measures, histograms or densities. This talk is an up-to-date tutorial on a selection of topics in OT. In the first part, I will give an intuitive description of OT, its behavior and basic properties. I will also explain a useful extension of the theory to deal with unnormalized distributions of mass. In the second part, I will introduce state-of-the-art numerical methods for solving OT related problems, namely scaling algorithms based on entropic regularization.
Insight into the biology of common diseases using summary statistics of large genome-wide association studies
Abstract: Data from genome-wide association studies (GWAS) contain valuable information about the genetic basis of the disease. For most common diseases, obtaining insights from these data is difficult because the signal is very diffuse: there are likely thousands or tens of thousands of causal variants, each with a very small effect size on disease risk. Moreover, for many of the largest disease GWAS, no individual researcher has access to all of the genotype data; rather, the only data available are meta-analyzed marginal effect size estimates for each variant. I will describe a powerful approach to modeling these summary statistics that allows us, for example, to identify disease-relevant tissues and cell types, or to quantify the degree to which two traits have a common genetic basis. The approach, called LD score regression, is based on a commonly used model in genetics in which the effect size of each variant on the disease is random. The parameters of this model provide information about the disease such as whether regions of the genome active in a given tissue (e.g., liver) tend to be more associated with disease than regions of the genome active in a second tissue (e.g., brain). I will present results from an application of LD score regression to identify relevant tissues and cell types from several large GWAS, and from an application of LD score regression to identify pairs of phenotypes with shared genetic basis. [papers 1, 2, 3, 4]
October 11, 2017
Sisi Sarkizova, Michael Rooney
Hacohen Lab, Broad / MGH; Neon Thereapeutics
Improving the rules of endogenous antigen prediction to support personalized cancer vaccine development
Abstract: In the seminar, we will see how tumor-specific mutations (neo-antigens) can stimulate the immune recognition of cancer cells and be used as a therapeutic strategy. For such strategy to be successful, we need to be able to predict which endogenous peptide antigens will be presented on the cell surface by polymorphic HLA class I gene variants. We will present analyses of our single HLA peptide data which allowed us to develop improved rules for endogenous peptide presentation based on the physicochemical properties of binding peptides, patterns of peptide cleavage and abundance of cognate transcripts. Incorporating these findings into neural network models improved prediction of endogenous peptide binding as compared to current predictive algorithms. We will end by reviewing very encouraging results from a tumor vaccine trial in melanoma patients.
Primer: Tumor immunity
Abstract: In the primer, we will introduce the basics of how the adaptive immune system recognizes diseased cells and see that immune responses rely on the ability of cytotoxic T cells to identify and eliminate cells that display disease-associated antigens bound to specific cell-surface receptors (the human leukocyte antigen (HLA) class I molecules). We will discuss how this mechanism extends to cancer, what are some strategies by which tumors evade immune detection, and what are the therapeutic interventions that can boost immune clearance of tumors.
Abstract: As data sets grow in dimensionality, making sense of the wealth of interactions they contain has become a daunting task, not just due to the sheer number of relationships but also because relationships come in different forms (e.g. linear, exponential, periodic, etc.) and strengths. If you do not already know what kinds of relationships might be interesting, how do you find the most important or unanticipated ones effectively and efficiently? This is commonly done by using a statistic to rank relationships in a data set and then manually examining the top of the resulting list. For such a strategy to succeed though, the statistic must give similar scores to equally noisy relationships of different types. In this talk we will formalize this property, called equitability, and show how it is related to a variety of traditional statistical concepts. We will then introduce the maximal information coefficient, a statistic that has state-of-the-art equitability in a wide range of settings, and discuss how its equitability translates to practical benefits in the search for dependence structure in high-dimensional data using examples from global health and the human gut microbiome.
Primer: Hypothesis testing and measures of dependence
Abstract: Searching for departures from statistical independence in data is a fundamental problem that has been formalized in a variety of ways. We will cover two frameworks in which this problem has historically been understood. The first is statistical and involves framing the search as a hypothesis test in a finite-sample setting. The second is probabilistic and involves defining functions of random variables that have useful properties in the large-sample limit. We will close with a discussion of common themes underlying measures of dependence arising from each of these paradigms.
Message passing algorithms for cryo-EM and synchronization
Abstract: Cryo-electron microscopy is a promising imaging technique in structural biology, yielding a large number of very noisy images of a macromolecule in different, unknown rotations. The computational task of reconciling these images into a 3D model of the molecule has proven both mathematically rich and challenging, leading to a mathematical formulation of "synchronization" problems: the learning task of aligning rotated objects based on noisy measurements of their pairwise relative rotations. We present an algorithm following the framework of approximate message passing, which statistical physics suggests may yield the optimal efficient reconstruction. Our approach leverages the representation theory of compact groups to give a unified, general theory for problems with various conceptual 'rotations' or 'alignments'. (Joint work with Amelia Perry, Afonso Bandeira, and Ankur Moitra.)
Abstract: In order to achieve state-of-the-art performance, modern machine learning techniques require careful data pre-processing and hyperparameter tuning. Moreover, given the ever increasing number of machine learning models being developed, model selection is becoming increasingly important. Automating the selection and tuning of machine learning pipelines consisting of data pre-processing methods and machine learning models, has long been one of the goals of the machine learning community. In this talk, we tackle this meta-learning task by combining ideas from collaborative filtering and Bayesian optimization. Using probabilistic matrix factorization techniques and acquisition functions from Bayesian optimization, we exploit experiments performed in hundreds of different datasets to guide the exploration of the space of possible pipelines. In our experiments, we show that our approach quickly identifies high-performing pipelines across a wide range of datasets, significantly outperforming the current state-of-the-art. We also show that this approach can be more generally used to tune parameters of any system (not just machine learning ones) by exploiting information gathered from multiple related experiments.
Abstract: Effective guide design is a key part of CRISPR-Cas9 deployment. Although molecular biology is working to improve CRISPR-Cas9 and related systems, one can make the guide design process more effective by using machine learning. We will discuss our state-of-the-art machine-learning based guide design models for both on-target (Azimuth) and off-target (Elevation) prediction.
Center for Genome Architecture, Baylor Med, Rice CS / Applied Math
A 3D Code in the Human Genome
Abstract: Stretched out from end-to- end, the human genome – a sequence of 3 billion chemical letters inscribed in a molecule called DNA – is over 2 meters long. Famously, short stretches of DNA fold into a double helix, which wind around histone proteins to form the 10nm fiber. But what about longer pieces? Does the genome’s fold influence function? How does the information contained in such an ultra-dense packing even remain accessible?
In this talk, I describe our work developing ‘Hi-C’ (Lieberman-Aiden et al., Science, 2009; Aiden, Science, 2011) and more recently ‘in-situ Hi-C’ (Rao & Huntley et al., Cell, 2014), which use proximity ligation to transform pairs of physically adjacent DNA loci into chimeric DNA sequences. Sequencing a library of such chimeras makes it possible to create genome-wide maps of physical contacts between pairs of loci, revealing features of genome folding in 3D.
Next, I will describe recent work using in situ Hi-C to construct haploid and diploid maps of nine cell types. The densest, in human lymphoblastoid cells, contains 4.9 billion contacts, achieving 1 kb resolution. We find that genomes are partitioned into contact domains (median length, 185 kb), which are associated with distinct patterns of histone marks and segregate into six subcompartments. We identify ∼10,000 loops. These loops frequently link promoters and enhancers, correlate with gene activation, and show conservation across cell types and species. Loop anchors typically occur at domain boundaries and bind the protein CTCF. The CTCF motifs at loop anchors occur predominantly (>90%) in a convergent orientation, with the asymmetric motifs “facing” one another.
Next, I will discuss the biophysical mechanism that underlies chromatin looping. Specifically, our data is consistent with the formation of loops by extrusion (Sanborn & Rao et al., PNAS, 2015). In fact, in many cases, the local structure of Hi-C maps may be predicted in silico based on patterns of CTCF binding and an extrusion-based model.
Finally, I will show that by modifying CTCF motifs using CRISPR, we can reliably add, move, and delete loops and domains. Thus, it possible not only to “read” the genome’s 3D architecture, but also to write it.
Primer: Introduction to Hi-C
Abstract: Hi-C is an assay which measures the frequency by which any two loci in the genome are in physical contact in the nucleus of a cell. We will begin by reviewing the Hi-C experimental procedure and discussing data normalization techniques. Then we will describe some common techniques to analyze Hi-C data. We will conclude by discussing some modifications/alternatives to Hi-C.
From genome to networks: a data-driven, tissue-specific view of human disease
Abstract: Identifying functional effects of noncoding variants is a major challenge in human genetics. I will discuss our deep learning–based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/) that predicts noncoding-variant effects de novo from genomic sequence. DeepSEA directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants and to predict tissue-specific expression based only on genomic sequence.
I will then discuss our work on building tissue-specific networks (http://hb.flatironinstitute.org/) to understand cell- and tissue-specific gene function and regulation and application of these networks to the study of autism spectrum disorder (ASD). ASD is a complex neurodevelopmental disorder with a strong genetic basis. Yet, only a small fraction of potentially causal genes—about 65 genes out of an estimated several hundred—are known with strong genetic evidence from sequencing studies. We developed a complementary machine-learning approach based on a human brain-specific gene network to present a genome-wide prediction of autism risk genes, including hundreds of candidates for which there is minimal or no prior genetic evidence.
Primer: Integrated, tissue-specific analysis of biological data
Abstract: The increasingly commonplace generation of genome-scale data provides us with a wealth of biological knowledge that captures global molecular-level changes in diverse model organisms and humans. However, these large data are often noisy, highly heterogenous, and lack the resolution required to study key aspects of metazoan complexity, such as tissue and cell-type specificity. In this primer, we will discuss a semi-supervised Bayesian network integration approach that leverages such large data compendia in concert with biological knowledge derived from small scale experiments to predict functional relationships between genes. We will then explore some of the applications of these models of tissue and cell function, including the prioritization of novel disease candidate genes based on genome-wide association studies (GWAS). Finally, we will demo publicly available web servers that provide interfaces to many of the analyses described here.
In search of lost time: reconstructing the evolutionary history of cancer genomes
Abstract: Chromosomal abnormalities are a hallmark feature of cancer genomes. In contrast to point mutations that mostly accumulate gradually during tumor development, chromosomal rearrangements are often generated episodically and manifest as clusters (or complex rearrangements). Such complexity makes it extremely difficult to identify patterns of chromosomal rearrangements or infer the history of rearrangement accumulation. In this talk, I will discuss our recent progress towards solving this problem based on three ideas. The first is to combine haplotype phasing and allelic copy-number analysis to determine the DNA copy number of each parental homolog. The second is to combine haplotype copy number and discordant read pairs to construct the sequence of rearranged chromosomes. Finally, we use knowledge from in vitro cell biology experiments to recognize unique rearrangement patterns. I will also discuss strategies to infer the timing of mutational events.
Spring 2017 Schedule: 8:30am Primer, 9:20am Breakfast, 9:30am Seminar, 10:30am Discussion; all in Monadnock
Department of Systems Biology, Harvard Medical School
Structure and fitness from genomic sequences
Abstract: The evolutionary trajectories of biological sequences are propelled by mutation and whittled away by selection to maintain and develop function. Present day sequences can therefore be regarded as the outcomes of millions of evolutionary experiments that record functional constraints in the genotype-phenotype map. In this talk I will first recap the primer by John and Adam that describes how a generative model for sequences can quantify evolutionary constraints on biomolecules in terms of couplings between specific residue combinations. I will show how we have applied this model to predict (i) accurate 3D structures of proteins, RNA and complexes, (ii) conformational plasticity of ‘disordered’ proteins, (iii) quantitative effects of mutations on organism fitness, and (iv) designed sequences of proteins with desired properties. These computational approaches address the challenge of inferring causality from correlations in genetic sequences but can be applied more widely to other biological information such as gene expression or dynamics, cellular phenotypes or drug response. I will introduce challenges and opportunities for extending these methods to diverse biomedical and engineering applications.
Marks Lab, Department of Systems Biology, Harvard Medical School
Primer: Generative models of biological sequence families
Abstract: Modern genome sequencing and synthesis can acquire and generate tremendous molecular diversity in a day, but our ability to navigate and interpret the exponentially large space of potential biological sequences remains limited. Central to this challenge is the lack of a priori knowledge about epistasis, i.e. non-additive interactions between positions in a molecule or genome. We will describe a class of generative models, discrete undirected graphical models, that, when fit to deep evolutionary sequence variation, can reveal both the three dimensional structures and mutational landscapes of proteins and RNAs, described in more detail in the talk by Debora after the break. In this primer, we will review the math and intuition behind these models, how they require approximate methods for scalable inference, and connections to other common methods in quantitative biology such as partial correlations and logistic regression. Lastly, we will outline how to go beyond pairwise and detect higher order epistasis with neural-network-powered generative models.
Gone Fishing: Unsupervised methods for discovery from public data
Abstract: Public gene expression data are abundant. Anybody with an internet connection can download more than 2 million genome-wide assays of gene expression. Learning from these data remains challenging. For example, public data often lack the annotations that enable traditional meta-analysis. If we could surmount these barriers, however, we'd have a valuable resource at our fingertips. Our lab uses machine learning methods to integrate these heterogeneous, noisy, and often poorly or incorrectly annotated data. We focus specifically on algorithms that are unsupervised and robust to noise in order to tackle unannotated data. We've shown that these algorithms can robustly reveal biological features in data from cancer biopsies to microbial systems. And we share these algorithms by building user-friendly software and web servers. Our aim is to make the reproducible analysis of big public data as routine in life sciences labs as wet-bench techniques like PCR.
Primer: Integrating biomedical knowledge to predict new uses for existing drugs
Abstract: How do you teach a computer biology? Our goal was to predict new uses for existing drugs. But we're data scientists, not pharmacologists. So we set out to encode the knowledge from millions of biomedical studies from the last half century. Using a heterogeneous network (hetnet) as our data structure, we were able to condense a large portion of biomedical knowledge into a network with 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. The network is named Hetionet v1.0 and lives at https://neo4j.het.io.
Hetionet enables queries that span many types of information. While such queries were possible before Hetionet, they often took months of data integration, preprocessing, and specialized query scripts. Now complex queries can be written in minutes using the Cypher query language for hetnets. Accordingly, we were able to perform ~47 million queries to assess the connectivity between 136 diseases and 1,538 compounds. Next, we compiled a catalog of 755 disease-modifying treatments and learned which types of network paths could predict whether a compound treats a disease. In total, we predicted probabilities of treatment for 209,168 compound-disease pairs (http://het.io/repurpose). Our method also allows you to compare which types of information were valuable for predicting drug efficacy. Project Rephetio, the codename for this project, was performed openly online in realtime (https://doi.org/bszr). In total, 40 community members provided feedback across 86 project discussions.
Attend the primer to learn more about Project Rephetio & Hetionet as well as hetnets for data integration and the Neo4j graph database. Research continuous as a set of open source GitHub repositories, allowing anyone interested to get involved.
Deep learning chemical space: a variational autoencoder for automatic molecular design
Abstract: Virtual screening is increasingly proven as a tool to test new molecules for a given application. Through simulation and regression we can gauge whether a molecule will be a promising candidate in an automatic and robust way. A large remaining challenge, however, is how to perform optimizations over a discrete space of size at least 10^60. Despite the size of chemical space, or perhaps precisely because of it, coming up with novel, stable, makeable molecules that are effective is not trivial. First-principles approaches to generating new molecules fail to capture the intuition embedded in the ~100 million existing molecules. I will report our progress towards developing an autoencoder that allows us to project molecular space into a continuous, differentiable representation where we can perform molecular optimization.
March 8, 2017
Carl de Boer
Regev Lab, Broad Institute
Learning the rules of gene regulation with millions of synthetic promoters
Abstract: Gene regulatory programs are encoded in the sequence of the DNA. However, how the cell uses transcription factors (TFs) to interpret regulatory sequence remains incompletely known. Synthetic regulatory sequences can provide insight into this logic by providing additional examples of sequences and their regulatory output in a controlled setting. Here, we have measured the gene expression output of tens of millions of unique promoter sequences, whose expressions span a range of 1000-fold, in a controlled reporter construct. This vast dataset of expression-DNA pairs represents a unique machine learning opportunity, and we use it to build quantitative models of transcriptional regulation based on biochemical principles. Even with a naive “billboard” model of gene regulation (with no positioning or complex TF-interactions), we can explain upwards of 92% of the variation in expression. We gain numerous insights into gene regulation, including a quantitative description of activation, repression, and chromatin modification for each TF, consistent with known TF activities and condition-specific regulators, and even use our data to refine the specificities of TFs. Although a “billboard” model explains the majority of expression in our system, certain TFs show position-, orientation-, and even DNA helical-face-dependent activities. We have so many promoter examples that we can look for potential spacing/orientation-dependent interactions between most TF pairs at base pair resolution, and find certain interactions consistent with biochemical cooperativity. Altogether, the principles learned here help us to better understand when and where TFs bind DNA, what they do when they get there, and how regulatory sequences evolve.
March 15, 2017
Data Sciences & Data Engineering, Broad
A scalable Bayesian framework for inferring copy number variation
Abstract: Inferring copy number variation (CNV) from next-generation sequencing (NGS) data is a challenging problem. On the one hand, the complexity of the NGS technology results in a highly non-uniform sampling of the genome with unknown latent factors. On the other hand, devising and implementing modern machine learning algorithms for CNV inference in a scalable and robust fashion is an arduous task due to the sheer size of the data. In this talk, we briefly review the existing approaches and glance over a number of their caveats, including difficulty with sex chromosomes, lack of a data-driven model for determining the number of bias latent factors, neglect of sampling noise, heuristic filtering and outlier detection, lack of self-consistency and scalability. Next, we introduce GATK gCNV, our principled and scalable Bayesian framework for germline CNV inference from whole-exome sequencing (WES) and whole-genome sequencing (WGS) data that addresses these caveats. We benchmark GATK gCNV, XHMM and CODEX on WES data against high-confidence Genome STRiPcalls on matched WGS data as ground truth, and show that GATK gCNV yields up to 30 percent higher sensitivity and specificity compared to the existing tools. We conclude the talk with a brief discussion of our ongoing efforts toward addressing the difficulty with common and large CNV events, and generalization to somatic CNV inference.
Data Sciences & Data Engineering, Broad
Primer: Bayesian PCA
Abstract: The model at the heart of GATK gCNV builds heavily on the probabilistic and Bayesian approaches to principal component analysis (PCA). In contrast with traditional PCA, the probabilistic approach provides a predictive model that can account for missing data, while the fully Bayesian approach further enables a principled way to learn the effective dimensionality of the principal subspace (i.e., the appropriate number of principal components to use). Model inference can be performed using expectation-maximization and variational-Bayesian methods, respectively. We will give a pedagogical overview of these methods, drawing analogies between Bayesian PCA and the perhaps more familiar Gaussian mixture model.
Probabilistic models of diversity: applications and algorithms for determinantal point processes
Abstract: Determinantal Point Processes (DPPs) are gaining popularity in machine learning as elegant probabilistic models of diversity. In other words, these are probability distributions over subsets of a collection of items (data points, features, ...) that prefer diverse subsets. In particular, many computations that are difficult with other models "simply" reduce to linear algebra for DPPs. DPPs have been known to arise in statistical physics, combinatorial probability and random matrix theory, and certain approximation algorithms. The first part of this talk will survey machine learning-related applications of DPPs, from recommendation, feature selection and improving interpretability to matrix approximations for kernel methods and pruning of neural networks.
Despite their ease of modeling, the wide applicability of DPPs has been hindered by computationally expensive sampling algorithms. The second part of the talk will address recent progress in sampling algorithms for DPPs and its implications in theory and practice. Most of the talk will be tutorial-style and does not require any prior knowledge of DPPs.
Based on joint work with Chengtao Li and Suvrit Sra.
Abstract: The primer will be a short tutorial that introduces Determinantal Point Processes with a bit of detail and intuition, explains its relations to diversity, basic computations, and important models.
Abstract: For centuries, our understanding of planetary systems has been based on observations of a unique sample, the Solar System. Similarly, our perspective on Life and habitats has remained Earth-centric, leaving millennia-old questions such as "Are we alone? Where/How/When did Life emerge?" unanswered. Two decades ago, the first planet orbiting another star than ours—a.k.a. an exoplanet—was discovered, opening a new chapter of space exploration. Since then, over 3,500 exoplanets have been found in over 2,500 other systems; a sample size increase of three orders of magnitude that has already yielded profound changes in our understanding of planetary systems. Similar changes await our perspective on Life and habitats within the next generation. During this talk, a “Searching for New Worlds 101” will be provided to introduce the TRAPPIST – 1 system, exploring our recent discovery of Earth-sized planets that are both potentially habitable and amenable for in-depth studies with upcoming observatories, and the first insights into their atmospheres, as revealed by the Hubble Space Telescope.
At the other end of the scale, biology focuses on chemical processes within cells rather than within atmospheres. A fundamental—and yet mostly overlooked—set of cellular processes gravitates around transient calcium signals. The availability of fast fluorescent calcium indicators allows for the measurements of intracellular calcium and thus provides direct observables of pathological and physiological calcium fluctuations. Calcium signals thereby offer new perspectives to approach a variety of diseases, from diabetes and metabolic disease to Alzheimer's disease.
Interestingly, these seemingly diverse fields of biology and planetary sciences share a common cornerstone: (Spectro)Photometric time series. With the arrival of high throughput facilities (e.g. TESS for exoplanetary sciences; FLIPR for biology), the need for standardized data acquisition/processing tools has emerged. The inherent similarity between these fields, in terms of multidisciplinarity and datatype, allows for mutually-beneficial collaborations that need to be leveraged to support the optimal sampling of yet unexplored parameter spaces, and their unbiased interpretation.
Grand Challenge: Mapping the regulatory wiring of the genome
Abstract: Our cells are controlled by complex molecular instructions encoded in the "noncoding" sequences of our genome, and alterations to these noncoding sequences underlie many common human diseases. The grammar of these noncoding sequences has been difficult to study, but the recent confluence of methods for both high-throughput measurement and high-throughput perturbation offers new opportunities to understand these sequences at a systems level. In this talk, I will highlight outstanding challenges in gene regulation where applying computational approaches in combination with emerging genomics datasets may allow us to build integrated maps that describe the regulatory wiring of the genome. As an example, I will present our efforts to experimentally and computationally map the functional connections between promoters and distal enhancers and use this information to understand human genetic variation in the noncoding genome.
Composing graphical models with neural networks for structured representations and fast inference
Abstract: I'll describe a new modeling and inference framework that combines the flexibility of deep learning with the structured representations of probabilistic graphical models. The model family augments latent graphical model structure, like switching linear dynamical systems, with neural network observation likelihoods. To enable fast inference, we show how to leverage graph-structured approximating distributions and, building on variational autoencoders, fit recognition networks that learn to approximate difficult graph potentials with conjugate ones. I'll show how these methods can be applied to learn how to parse mouse behavior from depth video.
Primer: Bayesian time series modeling with recurrent switching linear dynamical systems
Abstract: Many natural systems like neurons firing in the brain or basketball teams traversing a court give rise to time series data with complex, nonlinear dynamics. We gain insight into these systems by decomposing the data into segments that are each explained by simpler dynamical units. Bayesian time series models provide a flexible framework for accomplishing this task. This primer will start with the basics, introducing linear dynamical systems and their switching variants. With this background in place, I will introduce a new model class called recurrent switching linear dynamical systems(rSLDS), which discover distinct dynamical units as well as the input- and state-dependent manner in which units transition from one to another. In practice, this leads to models that generate much more realistic data than standard SLDS. Our key innovation is to design these recurrent SLDS models to enable recent Pólya-gamma auxiliary variable techniques and thus make approximate Bayesian learning and inference in these models easy, fast, and scalable.
Simulating, storing and processing genetic variation data for millions of samples
Abstract: Coalescent theory has played a key role in modern population genetics and is fundamental to our understanding of genetic variation. While simulation has been essential to coalescent theory from its beginnings, simulating realistic population-scale genome-wide data sets under the exact model was, until recently, considered infeasible. Even under an approximate model, simulating more than a few tens of thousands samples was very time consuming and could take several weeks to complete a single replicate. However, by encoding simulated genealogies using a new data structure (called a tree sequence), we can we now simulate entire chromosomes for millions of samples under the exact coalescent model in a few hours. We discuss some applications that these simulations have made possible, including a study of biases in human GWAS and the systematic benchmarking of variant processing tools at scale. The tree sequence data structure is also an extremely concise way of representing genetic variation data, and we show how variant data for millions of simulated human samples can be stored in only a few gigabytes. Moreover, we show that this very high level of compression does not incur a decompression cost. Because the information is represented in terms of the underlying genealogies, operations such as computing allele frequencies on sample subsets or measuring of linkage disequilibrium can be made very efficient. Finally, we discuss ongoing work on inferring tree sequences from observed data and present some preliminary results.
Department of Biomedical Informatics, Harvard Medical School
From one to millions of cells: computational challenges in single-cell analysis
Abstract: Over the last five years, our ability to isolate and analyze detailed molecular features of individual cells has expanded greatly. In particular, the number of cells measured by single-cell RNA-seq (scRNA-seq) experiments has gone from dozens to over a million cells, thanks to improved protocols and fluidic handling. Analysis of such data can provide detailed information on the composition of heterogeneous biological samples, and variety of cellular processes that altogether comprise the cellular state. Such inferences, however, require careful statistical treatment, to take into account measurement noise as well as inherent biological stochasticity. I will discuss several approaches we have developed to address such problems, including error modeling techniques, statistical interrogation of heterogeneity using gene sets, and visualization of complex heterogeneity patterns, implemented in PAGODA package. I will discuss how these approaches have been modified to enable fast analysis of very large datasets in PAGODA2, and how the flow of typical scRNA-seq analysis can be adapted to take advantage of potentially extensive repositories of scRNA-seq measurements. Finally, I will illustrate how such approaches can be used to study transcriptional and epigenetic heterogeneity in human brains.
Primer: Linking genetic and transcriptional intratumoral heterogeneity at the single cell level
May 10, 2017
Broad Fellow, Chemical Biology & Therapeutic Sciences
Continuous directed evolution: advances, applications, and opportunities
Abstract: The development and application of methods for the laboratory evolution of biomolecules has rapidly progressed over the last few decades. Advancements in continuous microbe culturing and selection design have facilitated the development of new technologies that enable the continuous directed evolution of proteins and nucleic acids. These technologies have the potential to support the extremely rapid evolution of biomolecules with tailor-made functional properties. Continuous evolution methods must support all of the key steps of laboratory evolution — translation of genes into gene products, selection or screening, replication of genes encoding the most fit gene products, and mutation of surviving genes — in a self-sustaining manner that requires little or no researcher intervention. In this presentation, I will describe the basis and applications of our Phage-Assisted Continuous Evolution (PACE) platform, solutions we have devised to address known limitations in the technique, and opportunities to improve PACE where in silico computation may play a key role. Through these tools, we aspire to enable researchers to address increasingly complex biological questions and to access biomolecules with novel or even unprecedented properties.
Edge-exchangeable graphs, clustering, and sparsity
Abstract: Many popular network models rely on the assumption of (vertex) exchangeability, in which the distribution of the graph is invariant to relabelings of the vertices. However, the Aldous-Hoover theorem guarantees that these graphs are dense or empty with probability one, whereas many real-world graphs are sparse. We present an alternative notion of exchangeability for random graphs, which we call edge exchangeability, in which the distribution of a graph sequence is invariant to the order of the edges. We demonstrate that a wide range of edge-exchangeable models, unlike any models that are traditionally vertex-exchangeable, can exhibit sparsity. To develop characterization theorems for edge-exchangeable graphs analogous to the powerful Aldous-Hoover theorem for vertex-exchangeable graphs, we turn to a seemingly different combinatorial problem: clustering. Clustering involves placing entities into mutually exclusive categories. A "feature allocation" relaxes the requirement of mutual exclusivity and allows entities to belong simultaneously to multiple categories. In the case of clustering the class of probability distributions over exchangeable partitions of a dataset has been characterized (via "exchangeable partition probability functions” and the "Kingman paintbox"). These characterizations support an elegant nonparametric Bayesian framework for clustering in which the number of clusters is not assumed to be known a priori. We show how these characterizations can be extended to feature allocations and, from there, to edge-exchangeable graphs.
Primer: Nonparametric Bayesian Models, methods, and applications
Abstract: Nonparametric Bayesian methods make use of infinite-dimensional mathematical structures to allow the practitioner to learn more from their data as the size of their data set grows. What does that mean, and how does it work in practice? In this tutorial, we'll cover why machine learning and statistics need more than just parametric Bayesian inference. We'll introduce such foundational nonparametric Bayesian models as the Dirichlet process and Chinese restaurant process and touch on the wide variety of models available in nonparametric Bayes. Along the way, we'll see what exactly nonparametric Bayesian methods are and what they accomplish.
Reconstructing trajectories and branching lineages in single cell genomics
Abstract: Single-cell technologies have gained popularity in developmental biology because they allow resolving potential heterogeneities due to asynchronicity of differentiating cells. Common data analysis encompasses normalization, followed by dimension reduction and clustering to identify subgroups. However, in the case of cellular differentiation, we may not expect clear clusters to be present - instead cells tend to follow continuous branching lineages.
In this talk I will first review methods for pseudotime ordering of cells according to their single cell profiles, which are used for reconstructing such trajectories. Then I will show that modeling the high-dimensional state space as a diffusion process, where cells move to close-by cells with a distance-dependent probability well reflects the differentiating characteristics. Based on the underlying diffusion map transition kernel, cells can be ordered according to a diffusion pseudotime (DPT), which allows for a robust identification of branching decisions and corresponding trajectories of single cells. After application to blood stem cell differentiation, I finish with current extensions towards single cell RNAseq time series and population models as well as driver-gene identification.
Center for Science and the Imagination, Arizona State University
What Algorithms Want
Abstract: We depend on — we believe in — algorithms to help us get a ride, choose which book to buy, execute a mathematical proof. It is as if we think of code as a magic spell, an incantation to reveal what we need to know and even what we want. But how do we navigate the gap between what algorithms really do and all the things we think, and hope, they do? This talk explores the evolving figure of the algorithm as it bridges the idealized space of computation and messy reality, with unpredictable and sometimes fascinating results. Drawing on sources that range from Neal Stephenson’s “Snow Crash” to Diderot’s “Encyclopédie,” from Adam Smith to the “Star Trek” computer, Finn explores the gap between theoretical ideas and pragmatic instructions, and the consequences of that gap for research at the intersection of computation and culture.
Fall 2016 Schedule: 8:30am Primer, 9:20am Breakfast, 9:30am Seminar, 10:30am Discussion; all in Monadnock
Composite measurements and molecular compressed sensing for efficient transcriptomics at scale
Abstract: Comprehensive RNA profiling provides an excellent phenotype of cellular responses and tissue states, but can be prohibitively expensive to generate at the massive scale required for studies of regulatory circuits, genetic states or perturbation screens. However, because expression profiles may reflect a limited number of degrees of freedom, a smaller number of measurements might suffice to capture most of the information. Here, we use existing mathematical guarantees to demonstrate that gene expression information can be preserved in a random low dimensional space. We propose that samples can be directly observed in low dimension through a fundamentally new type of measurement that distributes a single readout across many genes. We show by simulation that as few as 100 of these randomly composed measurements are needed to accurately estimate the global similarity between any pair of samples. Furthermore, we show that methods of compressive sensing can be used to recover gene abundances from drastically under-sampled measurements, even in the absence of any prior knowledge of gene-to-gene correlations. Finally, we propose an experimental scheme for such composite measurements. Thus, compressive sensing and composite measurements can become the basis of a massive scale up in the number of samples that can be profiled, opening new opportunities in the study of single cells, complex tissues, perturbation screens and expression-based diagnostics.
Automated Inference and the Promise of Probabilistic Programming
Abstract: Generative probability models allow us to 1) express assumptions about hidden patterns in data, 2) infer such hidden patterns, and 3) evaluate the accuracy of our findings.
However, designing modern models, developing custom inference algorithms, and evaluating accuracy requires enormous effort and cross-disciplinary expertise. Probabilistic programming promises to enable this process by making each step less arduous and more automated.
I will begin describing how probabilistic programming can help design modern probability models. I will then focus on automating inference for a wide class of probability models. To this end, I will describe automatic differentiation variational inference, a fully automated approximate inference algorithm. I will demonstrate its application to a mixture modeling analysis of a dataset with millions of observations. I intend to conclude with some thoughts on model evaluation, with a population genetics example.
Throughout this talk, I will highlight connections to our software project, Edward: a Python library for probabilistic modeling, inference, and evaluation.
Primer: Probabilistic Generative Models and Posterior Inference
Abstract: To model data we desire to express assumptions about the data, infer hidden structure, make predictions, and simulate new data. In this talk, I will describe how probabilistic generative models provide a common toolkit to meet these challenges. I will first present these ideas in a toy setting followed by discussing the range of probabilistic generative models from structural to algorithmic. Next I will present an in depth view of deep exponential families, a class of probability models containing both predictive and interpretive models. I will end with the central computational problem in realizing the promise of probabilistic generative models: posterior inference. I will demonstrate why deriving inference is tedious and will touch on black box variational methods which seek to alleviate this burden.
Overcoming Bias and Batch Effects in High-Throughput Data
Abstract: The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming science. In the life sciences, data analysis is now part of practically every research project. Genomics, in particular, is being driven by new measurement technologies that permit us to observe certain molecular entities for the first time. These observations are leading to discoveries analogous to identifying microorganisms and other breakthroughs permitted by the invention of the microscope. An examples of this are the many application of next generation sequencing.
Biases, systematic errors and unexpected variability are common in biological data. Failure to discover these problems often leads to flawed analyses and false discoveries. As datasets become larger, the potential of these biases to appear to be significant actually increases. In this talk I will describe several examples of these challenges using very specific examples from gene expression microarrays, RNA-seq, and single-cell assays. I will describe data science solution to these problems.
Primer: Experimental and computational techniques underlying RNA-seq
Abstract: We will provide an overview of the experimental and computational steps involved in RNA-seq for both bulk and single-cell experiments. We will begin with a brief review of Illumina short-read sequencing by synthesis; continue to describing the molecular biology used in preparing RNA-seq libraries; and discuss quality trimming, read alignment, transcript quantification and normalization of gene expression measures. We will conclude with a discussion of techniques commonly leveraged in single-cell RNA-Seq: linear pre-amplification, unique molecular identifiers (UMI/RMTs) and 3’-barcode counting. Throughout the primer, we will mention potential sources of bias that can be introduced at each step and why they occur.
Abstract: The usual framework for TDA takes as its starting point that a data set is sampled (noisily) from a manifold embedded in a high dimensional space, and provides a reconstruction of topological features of that manifold. However, the underlying algebraic topology can be applied to data in a much broader sense, carries much richer information about the system than just the barcodes, and can be fine-tuned so it sees only features of the data we want it to see. I will discuss this framework broadly, with focus on few of these alternative viewpoints, including applications to neuroscience and matrix factorization.
Abstract: A fundamental question in big data analysis is if or how these points may be sampled, noisily, from an intrinsically low-dimensional geometric shape, called a manifold, embedded in a high dimensional “sensor” space. Topological data analysis (TDA) aims to measure the “intrinsic shape” of data and identify this manifold despite noise and the likely nonlinear embedding. I will discuss the basics of the fundamental tool in TDA called persistent homology, which assigns to a point cloud a count of topological features –roughly “holes” of various dimensions – with a measure of importance of each feature recorded in a “barcode” of the data to help distinguish the significant features from the noise.
October 26, 2016
Harvard Medical School
FIDDLE: An integrative deep learning framework for functional genomic data inference
Abstract: Numerous advances in sequencing technologies have revolutionized genomics through generating many types of genomic functional data. Statistical tools have been developed to analyze individual data types, but there lack strategies to integrate disparate datasets under a unified framework. Moreover, most analysis techniques heavily rely on feature selection and data preprocessing which increase the difficulty of addressing biological questions through the integration of multiple datasets. Here, we introduce FIDDLE (Flexible Integration of Data with Deep LEarning) an open source data-agnostic flexible integrative framework that learns a unified representation from multiple data types to infer another data type. As a case study, we use multiple Saccharomyces cerevisiae genomic datasets to predict global transcription start sites (TSS) through the simulation of TSS-seq data. We demonstrate that a type of data can be inferred from other sources of data types without manually specifying the relevant features and preprocessing. We show that models built from multiple genome-wide datasets perform profoundly better than models built from individual datasets. Thus, FIDDLE learns the complex synergistic relationship within individual datasets and, importantly, across datasets.
Primer: Automatic differentiation, the algorithm behind all deep neural networks
Abstract: A painful and error-prone step of working with gradient-based models (deep neural networks being one kind) is actually deriving the gradient updates. Deep learning frameworks, like Torch, TensorFlow and Theano, have made this a great deal easier for a limited set of models — these frameworks save the user from doing any significant calculus by instead forcing the framework developers to do all of it. However, if a user wants to experiment with a new model type, or change some small detail the developers hadn’t planned, they are back to deriving gradients by hand. Fortunately, a 30+ year old idea, called “automatic differentiation”, and a one year old machine learning-oriented implementation of it, called “autograd”, can bring true and lasting peace to the hearts of model builders. With autograd, building and training even extremely exotic neural networks becomes as easy as describing the architecture. We will also address two practical questions — "What's the difference between all these deep learning libraries?" and "What does this all mean to me, as a biologist?" — as well as providing some detail and historical perspective on the topic of automatic differentiation.
Integrative, interpretable deep learning frameworks for regulatory genomics and epigenomics
Abstract: We present generalizable and interpretable supervised deep learning frameworks to predict regulatory and epigenetic state of putative functional genomic elements by integrating raw DNA sequence with diverse chromatin assays such as ATAC-seq, DNase-seq or MNase-seq. First, we develop novel multi-channel, multi-modal CNNs that integrate DNA sequence and chromatin accessibity profiles (DNase-seq or ATAC-seq) to predict in-vivo binding sites of a diverse set of transcription factors (TF) across cell types with high accuracy. Our integrative models provide significant improvements over other state-of-the-art methods including recently published deep learning TF binding models. Next, we train multi-task, multi-modal deep CNNs to simultaneously predict multiple histone modifications and combinatorial chromatin state at regulatory elements by integrating DNA sequence, RNA-seq and ATAC-seq or a combination of DNase-seq and MNase-seq. Our models achieve high prediction accuracy even across cell-types revealing a fundamental predictive relationship between chromatin architecture and histone modifications. Finally, we develop DeepLIFT (Deep Linear Importance Feature Tracker), a novel interpretation engine for extracting predictive and biological meaningful patterns from deep neural networks (DNNs) for diverse genomic data types. DeepLIFT is the first method that can integrate the combined effects of multiple cooperating filters and compute importance scores accounting for redundant patterns. We apply DeepLIFT on our models to obtain unified TF sequence affinity models, infer high resolution point binding events of TFs, dissect regulatory sequence grammars involving homodimer and heterodimeric binding with co-factors, learn predictive chromatin architectural features and unravel the sequence and architectural heterogeneity of regulatory elements.
Abstract: Tumors contain genetically heterogeneous cancerous subpopulations that can differ in their metastatic potential and response to treatment. Our work over the past few years has focused on using computational and statistical methods to reconstruct the phylogeny and the full genotypes of these subpopulations using data from high-throughput sequencing of tumor samples.
Tumor subpopulations can be partially characterised by identifying tumor-associated somatic variants using short read sequencing. Subsequent inference of copy number variants or clustering of the variant allele frequencies (VAFs) can reveal the number of major subpopulations present in the tumor as well as the set of mutations which first appear in each subpopulation. Further analysis, and often different data, is needed to determine how the subpopulations relate to one another and whether they share any mutations. Ideally, this analysis would reconstruct the full genotypes of each subpopulation.
I will describe my lab’s efforts to recover these full genotypes by reconstructing the tumor’s evolutionary history. We do this by fitting subpopulation phylogenies to the VAFs. In some circumstances, a full reconstruction is possible but often multiple phylogenies are consistent with the data. We have developed a number of methods (PhyloSub, PhyloWGS, treeCRP, PhyloSpan) that use Bayesian inference in non-parametric models to distinguish ambiguous and unambiguous portions of the phylogeny thereby explicitly representing reconstruction uncertainty. Our methods consider both single nucleotide variants as well as copy number variations and adapt to data on pairs of mutations.
Data Sciences & Data Engineering
Primer: Intro to Dirichlet Processes
Abstract: At a mundane level, Dirichlet processes are a clustering algorithm that determines the number of clusters. However, they are also a way to do Bayesian inference on a single infinite model rather than ad hoc model selection on a series of finite models and are the gateway to the field of Bayesian non-parametric models. Many introductions to Dirichlet processes take a formal measure-theoretic approach. In contrast, if you can understand the multinomial distribution you will understand this primer.
Beck Lab, Harvard Medical School at Beth Israel Deaconess Medical Center
Deep learning for computational pathology
Abstract: In this talk, we will provide an introduction to computational pathology, which is an emerging cross-discipline between pathology and computer engineering. Besides, we will introduce a deep learning-based automatic whole slide image analysis system for the identification of cancer metastases in breast sentinel lymph nodes. Our system won the 1st position in the International Challenge: Camelyon16, which was held at the International Symposium on Biomedical Imaging (ISBI) 2016. The system achieved an area under the receiver operating curve (AUC) of 0.925 for the task of whole slide image classification and an average sensitivity of 0.705 for the tumor localization task. A pathologist independently reviewed the same images, obtaining a whole slide image classification AUC of 0.966 and a tumor localization score of 0.733. By combining the predictions from the human pathologist and the automatic analysis system, the performance becomes even higher. These results demonstrate the power of using deep learning to produce significant improvements in the accuracy of pathological diagnoses.
Beck Lab, Harvard Medical School at Beth Israel Deaconess Medical Center
Primer: Practical recommendations for training convolutional neural nets
Abstract: Deep learning, in particular convolutional neural network (ConvNet), is rapidly emerging as one of the most successful approaches for image and speech recognition. What distinguishes ConvNets and other deep learning systems from conventional machine learning techniques is their ability to learn the entire perception process from end to end. Deep learning systems use multiple nonlinear processing layers to learn useful representations of features directly from data.
Searching the parameter space of deep architectures is a complex optimization task. ConvNets can be very sensitive to the setting of their hyper-parameters and network architecture setting. In this talk, I will give practical recommendations for training ConvNets and discuss the motivation and principles behind them. I will also provide recommendations on how to tackle various problems in analyzing medical image data such as lack of data, highly skewed class distributions, etc.
Finally, I will introduce some of the advanced ConvNet architectures used in medical image analysis and their suitability for various tasks such as detection, classification, and segmentation.
December 7, 2016
Spectral unmixing for next-generation mass spectrometry proteomics
Abstract: Mass spectrometry proteomics is the method of choice for large-scale quantitation of proteins in biological samples, allowing rapid measurement of the concentrations of thousands of proteins in various modified forms. However, this technique still faces fundamental challenges in terms of reproducibility, bias, and comprehensiveness of proteome coverage. Next-generation mass spectrometry, also known as data-independent acquisition, is a promising new approach with the potential to measure the proteome in a far more comprehensive and reproducible fashion than existing methods, but it has lacked a computational framework suited to the highly convoluted spectra it inherently produces. I will discuss Specter, an algorithm that employs linear unmixing to disambiguate the signals of individual proteins and peptides in next-generation mass spectra. In addition to describing the linear algebra underlying Specter, we'll discuss its implementation in Spark with Python, and see several real datasets to which it's been applied.
Primer: Mass spectrometry-based proteomics
Abstract: Mass spectrometry is the workhorse technology to study the abundance and composition of proteins, the key players in every living cell. Within the last decade the technology experienced a revolution in terms of novel instrumentation and optimized sample handling protocols resulting in ever growing numbers of proteins and post-translational modifications that can be routinely studied on a system-wide scale. Briefly, proteins are extracted from cells or tissues and fragmented into smaller peptides. This extremely complex peptide mixture is subjected to liquid chromatography separation and subsequent tandem mass spectrometry analysis in which mass-to-charge ratios of intact peptides and peptide fragments are recorded. Resulting mass spectra are matched to sequence databases or spectral libraries to read out the amino acid sequences and thereby identify the corresponding proteins.
The technology is fundamentally different from sequencing-based genomics technology and faces different problems, such as the tremendous dynamic range of protein expression. The instruments can be operated in different acquisition modes for different applications. I will briefly introduce the basics behind discovery or ‘shotgun’ proteomics, targeted proteomics, data dependent acquisition and data independent acquisition; the latter is a recent and promising development in the proteomics community but poses novel and only partly solved challenges in data analysis. Ryan Peckner will talk about Specter, an approach that tackles this problem using linear algebra.
December 14, 2016
Compiling probabilistic programs
Abstract: Deriving and implementing an inference algorithm for a probabilistic model can be a difficult and error-prone task. Alternatively, in probabilistic programming, a compiler is used to transform a model into an inference algorithm. In this talk, we'll present probabilistic programming from the perspective of a compiler writer. A compiler for a traditional language uses intermediate languages (ILs) and static analysis to generate efficient code. We'll highlight how these ideas can be used in probabilistic programming for generating flexible and scalable inference algorithms.
Hail Team, Neale Lab
Primer: What is a compiler?
Abstract: A compiler is an algorithm that transforms a source language into a target language. The transformation typically includes an optimizing pass which reduces memory or time requirements. Classic compilers transform languages such as C or Java into near-machine code such as x86 Assembly or JVM Bytecode. Recent work on Domain Specific Languages (DSLs) expands the notion of "source language" in order to enable everyone to build easy-to-reason-about abstractions without the performance penalty. In this context, I will discuss compiler design and implementation techniques with examples.
On Nov 12, the Broad welcomed a visit from Ryan Adams, a leader in machine learning - a field at the intersection of applied math and computer science that develops models and algorithms to learn from data...
Abstract: Deep learning will transform biology and medicine, but not in the way that many advocates think. Downloading ten thousand genomes and training a neural network to predict disease won't cut it. It is overly simplistic to believe that deep learning, or machine learning in general, can successfully be applied to genome data without taking into account biological processes that connect genotype to phenotype. The amount of data multiplied by the mutation frequency divided by the biological complexity and the number of hidden variables is too small. I’ll describe a rational “software meets bio” approach that has recently emerged in the research community and that is being pursued by dozens of young investigators. The approach has improved our ability to “read the genome”, and I believe it will have a significant impact on genome biology and medicine. I'll discuss which applications are ripe and which are merely seductive, how we should train models to take advantage of new types of data, and how we can interpret machine learning models.
Judging the importance of human mutations using evolutionary models
Abstract: Many forces influence the fate of alleles in populations, and the detailed quantitative description of the allelic dynamics is complex. However, some applications allow for simplifications making the evolutionary models useful in the context of human genetics. The examples include comparative genomics and the analysis of large scale sequencing datasets.
Systems biology: can mathematics lead experiments?
Abstract: The -omic revolution in biology, and parallel developments in microscopy and imaging, have opened up fascinating new opportunities for analysing biological data using tools from the mathematical sciences. However, the kind of data we have and the way we interpret them are determined by the conceptual landscape through which experimentalists reason about biology. In this talk, I will consider how mathematics can help to shape that conceptual landscape and thereby suggest new experimental strategies. I will describe some of our recent work on how eukaryotic genes are regulated, which tries to update conventional thinking in this field, which is largely derived from bacterial studies, and I will point out how this exercise gives rise to mathematical conjectures for which we currently have no solutions.
Abstract: Although the genetic information in each cell within an organism is identical, gene expression varies widely between different cell types. The quest to understand this phenomenon has led to many interesting mathematics problems. First, I will present a new method for learning gene regulatory networks. It overcomes the limitations of existing algorithms for learning directed graphs and is based on algebraic, geometric and combinatorial arguments. Second, I will analyze the hypothesis that the differential gene expression is related to the spatial organization of chromosomes. I will describe a bi-level optimization formulation to find minimal overlap configurations of ellipsoids and model chromosome arrangements. Analyzing the resulting ellipsoid configurations has important implications for the reprogramming of cells during development.
Abstract: What can we learn by observing nature? How can we understand and predict natural phenomena? This talk is on the mathematics of precision measurement. How can we solve for the input that generated the output of some measurement apparatus? Our starting point is an information theoretic prior of sparsity. We investigate sparse inverse problems where we assume the input can be described by a small number of parameters. We introduce some of our recent theoretical results in superresolution and in spectral clustering. In particular, we show how to solve infinite dimensional deconvolution problems with finite dimensional convex optimization. And we show why dimensionality reduction can be such a useful preprocessing step for mixture models.
Abstract: DNA of the human genome is 2m long and is folded into a structure that fits in a cell nucleus. One of the central physical questions here is the question of scales: How can microscopic processes of molecular interactions of nanometer scale drive chromosomal organization at microns? Inferring principles of 3D organization of chromosomes from a range of biological data is a challenging biophysical problem. We develop a top-down approach to biophysical modeling of chromosomes. Starting with a minimal set of biologically motivated interactions we build polymer models of chromosome organization that can reproduce major features observed in Hi-C and microscopy experiments. I will present our work on modeling organization of human metaphase and interphase chromosomes.
Haplotype phasing in large cohorts: Modeling, search, or both?
Abstract: Inferring haploid phase from diploid genotype data -- "phasing" for short -- is a fundamental question in human genetics and a key step in genotype imputation. How should one go about phasing a large cohort? The answer depends on how large. In this talk, I will contrast two approaches to computational phasing: hidden Markov models (HMMs), which perform precise but computationally expensive statistical inference, and long-range phasing (LRP), which relies instead on rapidly searching for long genomic segments shared among samples. I will present a new LRP method (Eagle), describe its performance on N=150,000 UK Biobank samples, and discuss future directions.
Abstract: Modern computing and the web are both enabling and changing how we do science. Using neuroscience as an example, I will highlight some of these developments, spanning a surprising diversity of technologies. I'll discuss distributed computing for data analytics, cloud computing and containerization for reproducibility, peer-to-peer networks for sharing data and knowledge, functional reactive programming for hardware control, and webgl for large-scale interactive experiments. And I will describe several open source projects we and others are working on across these domains. I hope to convey both what we're learning about the brain with these approaches, and how science itself is evolving in the process.
A quick introduction to TensorFlow and related API's
Abstract: TensorFlow was recently released to the open source world as a platform for developing cutting-edge ML models, with an emphasis on deep architectures including neural nets, convolutional neural nets, recurrent neural nets, and LSTM's. The open source version of TensorFlow now supports distributed computation across many machines, opening up a new level of scale to the research community. In this talk, we'll go over a quick introduction to the basic TensorFlow abstractions, and will also look at some higher-level API's that offer a convenient level of abstraction for many common use cases. Folks interested in learning more are encouraged to visit tensorflow.org, and the excellent Udacity course on ML featuring TensorFlow.
Harvard Organismic and Evolutionary Biology (Chair)
The effects of population pedigrees on gene genealogies
Abstract: The models of coalescent theory for diploid organisms are wrongly based on averaging over reproductive, or family, relationships. In fact, the entire set of relationships, which may be called the population pedigree, is fixed by past events. Because of this, the standard equations of population genetics for probabilities of common ancestry are incorrect. However, the predictions of coalescent models appear surprisingly accurate for many purposes. A number of different scenarios will be investigated using simulations to illustrate the effects of pedigrees on gene genealogies both within and among loci. These scenarios include selective sweeps, the occurrence of very large families, and population subdivision with migration.
AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery
Abstract: Deep convolutional neural networks (neural nets with a constrained architecture that leverages the spatial and temporal structure of the domain they model) achieve the best predictive performance in areas such as speech and image recognition. Such neural networks autonomously discover and hierarchically compose simple local features into complex models. We demonstrate that biochemical interactions, being similarly local, are amenable to automatic discovery and modeling by similarly-constrained machine learning architectures. We describe the training of AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications, on millions of training examples derived from ChEMBL and the PDB. We visualize the automatically-derived convolutional filters and demonstrate that the system is discovering chemically sensible interactions. Finally, we demonstrate the utility of autonomously-discovered filters by outperforming previous docking approaches and achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. In further contrast to existing DNN techniques, we show that AtomNet’s application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators.
Information in Cell Images: Targeting Diseases and Characterizing Compounds
Abstract: Our lab, the Broad’s Imaging Platform, aims to make perturbations in cell morphology as computable as other large-scale functional genomics data. We began by creating model-based segmentation algorithms to identify regions of interest in images (usually, individual cells or compartments within them) and produced software that has become the world standard for image analysis from high-throughput microscopy experiments (CellProfiler, cited in 3000+ scientific papers). We have taken on a new challenge – using cell images to identify signatures of genes and chemicals, with the ultimate goal of finding the cause and potential cures of diseases. High-throughput microscopy enables imaging several thousand cells per chemical or genetic perturbation, and identifying multiple organelles using fluorescent markers yields hundreds of image features per cell. We use this rich information to construct perturbation signatures or “profiles”. Our goals in these profiling experiments include identifying drug targets and mechanisms of action, determining the functional impact of disease-related alleles, creating performance-diverse chemical libraries, categorizing mechanisms of drug toxicity, and uncovering diagnostic markers for psychiatric disease.The technical challenges we encounter include dealing with cellular subpopulation heterogeneity, interpreting and visualizing statistical models, learning better representations of the data, and integrating imaging information with other data modalities.
DNA microscopy and the sequence-to-image inverse problem
Abstract: Technologies that jointly resolve both gene sequences and the spatial relationships of the cells that express them are playing an increasing role in deepening our understanding of tissue biology. In this talk, I will describe an experimental technique, called DNA microscopy, which encodes the physical structure and genetic composition of a biological sample directly into a library of DNA sequences. I will then discuss and demonstrate the application of N-body optimization to the inverse problem of inferring positions from real data.
Abstract: Molecular biology increasingly relies on large screens where enormous numbers of specimens are systematically assayed in the search for a particular, rare outcome. These screens include the systematic testing of small molecules for potential drugs and testing the association between genetic variation and a phenotype of interest. While these screens are ``hypothesis-free,'' they can be wasteful; pooling the specimens and then testing the pools is more efficient. We articulate in precise mathematical ways the type of structures useful in combinatorial pooling designs so as to eliminate waste, to provide light weight, flexible, and modular designs. We show that Reed-Solomon codes, and more generally linear codes, satisfy all of these mathematical properties. We further demonstrate the power of this technique with Reed-Solomon-based biological experiments. We provide general purpose tools for experimentalists to construct and carry out practical pooling designs with rigorous guarantees for large screens.
The Science of Information: Case Studies in DNA and RNA Assembly
Abstract: Claude Shannon invented information theory in 1948 to study the fundamental limits of communication. The theory not only establishes the baseline to judge all communication schemes but inspires the design of ones that are simultaneously information optimal and computationally efficient. In this talk, we discuss how this point of view can be applied on the problems of de novo DNA and RNA assembly from shotgun sequencing data. We establish information limits for these problems, and show how efficient assembly algorithms can be designed to attain these information limits, despite the fact that combinatorial optimization formulations of these problems are NP-hard. We discuss Shannon, a de novo RNA-seq assembly software designed based on such principles, and compare its performance against state-of-the-art assemblers on several datasets.
Abstract: Sparse regression has become an indispensable method for data analysis in the last 20 years. The general framework for sparse regression has a number of drawbacks that we and others address in recent methods, including robustness of model selection, issues with correlated predictors, and a test statistic that is based on the size of the effect. All of these issues arise in the context of association mapping of genetic variants to quantitative traits. This talk will discuss one approach to structured sparse regression to mitigate these problems in the context of genome-wide association mapping with quantitative traits using a Gaussian process prior to add structure to the sparsity-inducing prior across predictors. We will also describe ongoing efforts for variants on this model for different analytic purposes, including neuroscience applications, identifying driver somatic mutations in cancer, and methods for causal inference in observational data with large numbers of instruments.
Abstract: Latent variable models have become a key tool for the modern statistician, letting us express complex assumptions about the hidden structures that underlie our data. Latent variable models have been successfully applied in numerous fields.
The central computational problem in latent variable modeling is posterior inference, the problem of approximating the conditional distribution of the latent variables given the observations. Posterior inference is central to both exploratory tasks and predictive tasks. Approximate posterior inference algorithms have revolutionized Bayesian statistics, revealing its potential as a usable and general-purpose language for data analysis.
Bayesian statistics, however, has not yet reached this potential. First, statisticians and scientists regularly encounter massive data sets, but existing approximate inference algorithms do not scale well. Second, most approximate inference algorithms are not generic; each must be adapted to the specific model at hand.
In this talk I will discuss our recent research on addressing these two limitations. I will describe stochastic variational inference, an approximate inference algorithm for handling massive data sets. I will demonstrate its application in genetics to the STRUCTURE model of Pritchard et al., 2000. Then I will discuss black box variational inference. Black box inference is a generic algorithm for approximating the posterior. We can easily apply it to many models with little model-specific derivation and few restrictions on their properties. I will demonstrate how we can use black box inference to develop new software tools for probabilistic modeling.
University of Washington, CS, EE and Genome Sciences
Identifying molecular markers for cancer treatment from big data
Abstract: The repertoire of drugs for patients with cancer is rapidly expanding, however cancers that appear pathologically similar often respond differently to the same drug regimens. Methods to better match patients to specific drugs are in high demand. For example, patients over 65 with acute myeloid leukemia (AML), an aggressive blood cancer, have no better prognosis today than they did in 1980. For a growing number of diseases, there is a fair amount of data on molecular profiles from patients. The most important step necessary to realize the ultimate goal is to identify molecular markers in these data that predict treatment outcomes, such as response to each chemotherapy drug. However, due to the high-dimensionality (i.e., the number of variables is much greater than the number of samples) along with potential biological or experimental confounders, it is an open challenge to identify robust biomarkers that are replicated across different studies. In this talk, I will present two novel machine learning algorithms to resolve these challenges. These methods learn the low-dimensional features that are likely to represent important molecular events in the disease process in an unsupervisedfashion, based on molecular profiles from multiple populations of cancer patients. These algorithms led to the identification of novel molecular markers in AML and ovarian cancer.