Models, Inference & Algorithms (MIA) is a Broad initiative to support learning and collaboration across the interface of biology and mathematics / statistics / machine learning / computer science.
Our core activity is the Wednesday morning meeting in the Monadnock room (415 Main St, 2nd floor), featuring a method primer at 9, a main seminar with breakfast at 10, and a discussion with the speaker at 11. These meetings grew out of the Stat Math Reading Club (SMRC), a series of informal and pedagogical board talks; over a year and a half the talks attracted an ever-larger audience from Broad and the wider Boston community. With MIA we strive to maintain SMRC's essential character, emphasizing lucid exposition of broadly applicable ideas over rapid-fire communication of research results and encouraging questions from the audience throughout. In addition to the weekly meeting there are a number of other MIA activities. Please contact email@example.com to be added to our mailing list and learn more.
The MIA Initiative is led by Jon Bloom and Alex Bloemendal, and affiliated with the Data Sciences Platform, the Office of the Chair of the Faculty, and the MIT math department. The MIA community has been enriched by the awesome efforts of Hilary Finucane, David Benjamin, Yakir Reshef, Umut Eser, David Rolnick, Ryan Peckner, Ann Sizemore, and many others. SMRC thrived in collaboration with Bertrand Haas, Anthony Phillipakis, Yossi Farjoun, and Cotton Seed.
Spring 2018 Schedule: 9:00am Primer, 9:50am Breakfast, 10:00am Seminar, 11:00am Discussion; all in Monadnock
|Feb 7||No primer|
|Tami Lieberman (in Auditorium today)||MIT IMES, CEE||Rapid bacterial adaptation within individual human microbiomes|
|Feb 14||Fredrik Johansson||MIT IMES||Primer on causal inference|
|David Sontag||MIT EECS, IMES, CSAIL||AI for health needs causality|
|Feb 21||No primer|
|Po-Ru Loh and Giulio Genovese||HMS and Broad||Leveraging long range phasing to detect mosaicism in blood at ultra-low allelic fractions|
|Feb 28||Ray Jones||Broad||TBD|
|Omer Weissbrod||Harvard HSPH||TBD (Microbiome explainability)|
|Jacob Oppenheim||Indigo Agriculture||TBD (Latent representations of microbial diversity)|
Institute for Medical Sciences and Engineering and Department of Civil and Environmental Engineering, MIT
Rapid bacterial adaptation within individual human microbiomes
Abstract: We tend to think about human microbiomes as communities of static species. The degree to which individual commensal species functionally change within individual people has remained elusive, as it is difficult to identify de novo mutations from metagenomic data of mixed communities. We have recently discovered that commensal members of our microbiota acquire de novo mutations with strong fitness consequences within individual people. In this talk, I will discuss the challenges with identifying within-person evolution from metagenomics alone, describe how culture-dependent methods enable powerful evolutionary inferences, and touch on the implications of an evolving microbiome for data interpretation and therapy.
MIT EECS, IMES, CSAIL
AI for health needs causality
Abstract: Recent success stories of using machine learning for diagnosing skin cancer, diabetic retinopathy, pneumonia, and breast cancer may give the impression that artificial intelligence (AI) is on the cusp of radically changing all aspects of health care. However, many of the most important problems, such as predicting disease progression, personalizing treatment to the individual, drug discovery, and finding optimal treatment policies, all require a fundamentally different way of thinking. Specifically, these problems require a focus on *causality* rather than simply prediction. Motivated by these challenges, my lab has been developing several new approaches for causal inference from observational data. In this talk, I describe our recent work on the deep Markov model (Krishnan, Shalit, Sontag AAAI '17) and TARNet (Shalit, Johansson, Sontag, ICML '17).
Primer on causal inference
Abstract: A common goal in science is to use knowledge gained by observing a phenomenon of interest to guide decision and policy making. If smokers are observed to have higher rates of lung cancer, should we legislate to discourage smoking? Such a policy will only be effective if smoking itself is the cause of cancer and the correlation between cancer rate and smoking is not explained by other factors, such as lifestyle choices. Problems like these are well described in the language of causal inference. In this primer, we explain the difference between statistical and causal reasoning, and introduce the notions of confounding, causal graphs and counterfactuals. We cover the problem of estimating causal effects from experimental and observational data, as well as sufficient assumptions to make causal statements based on statistical quantities.
Po-Ru Loh, Giulio Genovese
BWH, HMS, Broad
Leveraging long range phasing to detect mosaicism in blood at ultra-low allelic fractions
Abstract: What would you do with 500,000 near-perfectly phased genomes? One answer (we welcome others!): harness this information to detect subtle imbalances between maternal and paternal allelic fractions in blood DNA -- the hallmark of clonal mosaic chromosomal alterations [1,2]. In this talk, we will describe how we phased the UK Biobank to chromosome-scale accuracy [3,4], developed HMM-based machinery to sensitively call mosaic alterations, and probed the data to reveal new insights into the causes and consequences of clonal hematopoiesis.
Fall 2017 Schedule: 9:00am Primer, 9:50am Breakfast, 10:00am Seminar, 11:00am Discussion; all in Monadnock
Reading the rules of gene regulation from the human noncoding genome
Abstract: Functional genomics approaches to better model genotype-phenotype relationships have important applications toward understanding genomic function and improving human health. In particular, thousands of noncoding loci associated with diseases and physical traits lack mechanistic explanation. I'll present a machine-learning system to predict cell type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. Using convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. I'll show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable GWAS loci fine mapping.
Broad Data Sciences Platform
Primer: Classifying genomic sequences with convolutional neural networks
Abstract: Initially developed for image processing, Convolutional Neural Networks (CNNs) have been applied to genomic data with promising results. This primer will trace some of the history of neural networks with an eye towards the practical lessons learnt along the way. Then building on the idea of the Position Weight Matrix as a motif detector we will explore exactly what convolution means when applied to a DNA sequence. While drawing examples from computer vision and natural language processing, our focus will be on the application of CNNs to genomic data. Lastly, we will cover recent advances in CNNs including residual connections and dilated convolutions.
Harvard CS, Harvard/MIT MD/PhD program
Detecting effects of transcription factors on disease
Abstract: Learning biology using GWAS data frequently involves identifying genomic regions involved in a biological process and assessing for enrichment of GWAS signal in those regions. But in some cases, e.g., binding of a transcription factor (TF), improving models and growing data sets allow us to estimate in a signed way whether genetic variants promote or hinder a biological process. I'll present a new method, signed LD profile regression, for combining this type of information with GWAS data to draw relatively strong inferences about trait mechanism. I'll then describe how this method can be applied in conjunction with signed genomic annotations reflecting binding of ~100 TFs in various cell lines generated using a convolutional neural network, Basset. Finally, I'll discuss some results from applying our method to GWAS data about a range of traits including gene expression, epigenetic traits, and several diseases.
Finucane Lab, Broad Institute
Primer: Generalized least squares
Abstract: Linear models are a very common choice when modeling the relation between inputs and outputs because of their simplicity and interpretability. We will explore methods for parameter estimation in these models, with an eye toward understanding some of the more advanced techniques. We will start by reviewing the most commonly used estimator: the ordinary least squares (OLS) estimator. Then we will explore some limitations of the OLS estimator when the residuals are not i.i.d. and discuss how to overcome these limitations, first with with weighted least squares and then with generalized least squares. We'll close by discussing linear models in the context of genome-wide association studies (GWAS) as a lead-in to the talk.
Broad / HMS
Learning phylogeny through f-statistics
Abstract: f-statistics are now a commonplace tool in population genetics, used to discover and test models for demographic history. We describe the theory and ADMIXTOOLS, a package that implements most of the tests used. We give a number of examples of discoveries about human history made using f-statistics and conclude with some things we would like to do better and some open questions. [paper]
Broad Institute, MIT Statistics
Learning developmental landscapes from single-cell gene expression with optimal transport
Abstract: Understanding the molecular programs that guide cellular differentiation during development is a major goal of modern biology. Here, we introduce an approach, WADDINGTON-OT, based on the mathematics of optimal transport, for inferring developmental landscapes, probabilistic cellular fates and dynamic trajectories from large-scale single-cell RNA-seq (scRNA-seq) data collected along a time course. We demonstrate the power of WADDINGTON-OT by applying the approach to study 65,781 scRNA-seq profiles collected at 10 time points over 16 days during reprogramming of fibroblasts to iPSCs. We construct a high-resolution map of reprogramming that rediscovers known features; uncovers new alternative cell fates including neural- and placental-like cells; predicts the origin and fate of any cell class; highlights senescent-like cells that may support reprogramming through paracrine signaling; and implicates regulatory models in particular trajectories. Of these findings, we highlight Obox6, which we experimentally show enhances reprogramming efficiency. Our approach provides a general framework for investigating cellular differentiation. [paper]
ENS Paris Mathematics
Primer: A tutorial on optimal transport
Abstract: The optimal transport (OT) problem is often described as that of finding the most efficient way of moving a pile of dirt from one configuration to another. Once stated formally, OT provides extremely useful tools for comparing, interpolating and processing objects such as distributions of mass, probability measures, histograms or densities. This talk is an up-to-date tutorial on a selection of topics in OT. In the first part, I will give an intuitive description of OT, its behavior and basic properties. I will also explain a useful extension of the theory to deal with unnormalized distributions of mass. In the second part, I will introduce state-of-the-art numerical methods for solving OT related problems, namely scaling algorithms based on entropic regularization.
Insight into the biology of common diseases using summary statistics of large genome-wide association studies
Abstract: Data from genome-wide association studies (GWAS) contain valuable information about the genetic basis of the disease. For most common diseases, obtaining insights from these data is difficult because the signal is very diffuse: there are likely thousands or tens of thousands of causal variants, each with a very small effect size on disease risk. Moreover, for many of the largest disease GWAS, no individual researcher has access to all of the genotype data; rather, the only data available are meta-analyzed marginal effect size estimates for each variant. I will describe a powerful approach to modeling these summary statistics that allows us, for example, to identify disease-relevant tissues and cell types, or to quantify the degree to which two traits have a common genetic basis. The approach, called LD score regression, is based on a commonly used model in genetics in which the effect size of each variant on the disease is random. The parameters of this model provide information about the disease such as whether regions of the genome active in a given tissue (e.g., liver) tend to be more associated with disease than regions of the genome active in a second tissue (e.g., brain). I will present results from an application of LD score regression to identify relevant tissues and cell types from several large GWAS, and from an application of LD score regression to identify pairs of phenotypes with shared genetic basis. [papers 1, 2, 3, 4]
Sisi Sarkizova, Michael Rooney
Hacohen Lab, Broad / MGH; Neon Thereapeutics
Improving the rules of endogenous antigen prediction to support personalized cancer vaccine development
Abstract: In the seminar, we will see how tumor-specific mutations (neo-antigens) can stimulate the immune recognition of cancer cells and be used as a therapeutic strategy. For such strategy to be successful, we need to be able to predict which endogenous peptide antigens will be presented on the cell surface by polymorphic HLA class I gene variants. We will present analyses of our single HLA peptide data which allowed us to develop improved rules for endogenous peptide presentation based on the physicochemical properties of binding peptides, patterns of peptide cleavage and abundance of cognate transcripts. Incorporating these findings into neural network models improved prediction of endogenous peptide binding as compared to current predictive algorithms. We will end by reviewing very encouraging results from a tumor vaccine trial in melanoma patients.
Primer: Tumor immunity
Abstract: In the primer, we will introduce the basics of how the adaptive immune system recognizes diseased cells and see that immune responses rely on the ability of cytotoxic T cells to identify and eliminate cells that display disease-associated antigens bound to specific cell-surface receptors (the human leukocyte antigen (HLA) class I molecules). We will discuss how this mechanism extends to cancer, what are some strategies by which tumors evade immune detection, and what are the therapeutic interventions that can boost immune clearance of tumors.
Detecting novel associations in large data sets
Abstract: As data sets grow in dimensionality, making sense of the wealth of interactions they contain has become a daunting task, not just due to the sheer number of relationships but also because relationships come in different forms (e.g. linear, exponential, periodic, etc.) and strengths. If you do not already know what kinds of relationships might be interesting, how do you find the most important or unanticipated ones effectively and efficiently? This is commonly done by using a statistic to rank relationships in a data set and then manually examining the top of the resulting list. For such a strategy to succeed though, the statistic must give similar scores to equally noisy relationships of different types. In this talk we will formalize this property, called equitability, and show how it is related to a variety of traditional statistical concepts. We will then introduce the maximal information coefficient, a statistic that has state-of-the-art equitability in a wide range of settings, and discuss how its equitability translates to practical benefits in the search for dependence structure in high-dimensional data using examples from global health and the human gut microbiome.
Primer: Hypothesis testing and measures of dependence
Abstract: Searching for departures from statistical independence in data is a fundamental problem that has been formalized in a variety of ways. We will cover two frameworks in which this problem has historically been understood. The first is statistical and involves framing the search as a hypothesis test in a finite-sample setting. The second is probabilistic and involves defining functions of random variables that have useful properties in the large-sample limit. We will close with a discussion of common themes underlying measures of dependence arising from each of these paradigms.
Message passing algorithms for cryo-EM and synchronization
Abstract: Cryo-electron microscopy is a promising imaging technique in structural biology, yielding a large number of very noisy images of a macromolecule in different, unknown rotations. The computational task of reconciling these images into a 3D model of the molecule has proven both mathematically rich and challenging, leading to a mathematical formulation of "synchronization" problems: the learning task of aligning rotated objects based on noisy measurements of their pairwise relative rotations. We present an algorithm following the framework of approximate message passing, which statistical physics suggests may yield the optimal efficient reconstruction. Our approach leverages the representation theory of compact groups to give a unified, general theory for problems with various conceptual 'rotations' or 'alignments'. (Joint work with Amelia Perry, Afonso Bandeira, and Ankur Moitra.)
Automated Machine Learning
Abstract: In order to achieve state-of-the-art performance, modern machine learning techniques require careful data pre-processing and hyperparameter tuning. Moreover, given the ever increasing number of machine learning models being developed, model selection is becoming increasingly important. Automating the selection and tuning of machine learning pipelines consisting of data pre-processing methods and machine learning models, has long been one of the goals of the machine learning community. In this talk, we tackle this meta-learning task by combining ideas from collaborative filtering and Bayesian optimization. Using probabilistic matrix factorization techniques and acquisition functions from Bayesian optimization, we exploit experiments performed in hundreds of different datasets to guide the exploration of the space of possible pipelines. In our experiments, we show that our approach quickly identifies high-performing pipelines across a wide range of datasets, significantly outperforming the current state-of-the-art. We also show that this approach can be more generally used to tune parameters of any system (not just machine learning ones) by exploiting information gathered from multiple related experiments.
Machine-Learning-Based CRISPR Guide Design
Abstract: Effective guide design is a key part of CRISPR-Cas9 deployment. Although molecular biology is working to improve CRISPR-Cas9 and related systems, one can make the guide design process more effective by using machine learning. We will discuss our state-of-the-art machine-learning based guide design models for both on-target (Azimuth) and off-target (Elevation) prediction.
Erez Lieberman Aiden
Center for Genome Architecture, Baylor Med, Rice CS / Applied Math
A 3D Code in the Human Genome
Abstract: Stretched out from end-to- end, the human genome – a sequence of 3 billion chemical letters inscribed in a molecule called DNA – is over 2 meters long. Famously, short stretches of DNA fold into a double helix, which wind around histone proteins to form the 10nm fiber. But what about longer pieces? Does the genome’s fold influence function? How does the information contained in such an ultra-dense packing even remain accessible?
In this talk, I describe our work developing ‘Hi-C’ (Lieberman-Aiden et al., Science, 2009; Aiden, Science, 2011) and more recently ‘in-situ Hi-C’ (Rao & Huntley et al., Cell, 2014), which use proximity ligation to transform pairs of physically adjacent DNA loci into chimeric DNA sequences. Sequencing a library of such chimeras makes it possible to create genome-wide maps of physical contacts between pairs of loci, revealing features of genome folding in 3D.
Next, I will describe recent work using in situ Hi-C to construct haploid and diploid maps of nine cell types. The densest, in human lymphoblastoid cells, contains 4.9 billion contacts, achieving 1 kb resolution. We find that genomes are partitioned into contact domains (median length, 185 kb), which are associated with distinct patterns of histone marks and segregate into six subcompartments. We identify ∼10,000 loops. These loops frequently link promoters and enhancers, correlate with gene activation, and show conservation across cell types and species. Loop anchors typically occur at domain boundaries and bind the protein CTCF. The CTCF motifs at loop anchors occur predominantly (>90%) in a convergent orientation, with the asymmetric motifs “facing” one another.
Next, I will discuss the biophysical mechanism that underlies chromatin looping. Specifically, our data is consistent with the formation of loops by extrusion (Sanborn & Rao et al., PNAS, 2015). In fact, in many cases, the local structure of Hi-C maps may be predicted in silico based on patterns of CTCF binding and an extrusion-based model.
Finally, I will show that by modifying CTCF motifs using CRISPR, we can reliably add, move, and delete loops and domains. Thus, it possible not only to “read” the genome’s 3D architecture, but also to write it.
Primer: Introduction to Hi-C
Abstract: Hi-C is an assay which measures the frequency by which any two loci in the genome are in physical contact in the nucleus of a cell. We will begin by reviewing the Hi-C experimental procedure and discussing data normalization techniques. Then we will describe some common techniques to analyze Hi-C data. We will conclude by discussing some modifications/alternatives to Hi-C.
Princeton Genomics, Flatiron Institute
From genome to networks: a data-driven, tissue-specific view of human disease
Abstract: Identifying functional effects of noncoding variants is a major challenge in human genetics. I will discuss our deep learning–based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/) that predicts noncoding-variant effects de novo from genomic sequence. DeepSEA directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants and to predict tissue-specific expression based only on genomic sequence.
I will then discuss our work on building tissue-specific networks (http://hb.flatironinstitute.org/) to understand cell- and tissue-specific gene function and regulation and application of these networks to the study of autism spectrum disorder (ASD). ASD is a complex neurodevelopmental disorder with a strong genetic basis. Yet, only a small fraction of potentially causal genes—about 65 genes out of an estimated several hundred—are known with strong genetic evidence from sequencing studies. We developed a complementary machine-learning approach based on a human brain-specific gene network to present a genome-wide prediction of autism risk genes, including hundreds of candidates for which there is minimal or no prior genetic evidence.
All predictions and networks are available at http://asd.princeton.edu/.
Troyanskaya Lab, Princeton Genomics
Primer: Integrated, tissue-specific analysis of biological data
Abstract: The increasingly commonplace generation of genome-scale data provides us with a wealth of biological knowledge that captures global molecular-level changes in diverse model organisms and humans. However, these large data are often noisy, highly heterogenous, and lack the resolution required to study key aspects of metazoan complexity, such as tissue and cell-type specificity. In this primer, we will discuss a semi-supervised Bayesian network integration approach that leverages such large data compendia in concert with biological knowledge derived from small scale experiments to predict functional relationships between genes. We will then explore some of the applications of these models of tissue and cell function, including the prioritization of novel disease candidate genes based on genome-wide association studies (GWAS). Finally, we will demo publicly available web servers that provide interfaces to many of the analyses described here.
Cheng-Zhong Zhang, Rick Tourdot
HMS, Dana Farber
In search of lost time: reconstructing the evolutionary history of cancer genomes
Abstract: Chromosomal abnormalities are a hallmark feature of cancer genomes. In contrast to point mutations that mostly accumulate gradually during tumor development, chromosomal rearrangements are often generated episodically and manifest as clusters (or complex rearrangements). Such complexity makes it extremely difficult to identify patterns of chromosomal rearrangements or infer the history of rearrangement accumulation. In this talk, I will discuss our recent progress towards solving this problem based on three ideas. The first is to combine haplotype phasing and allelic copy-number analysis to determine the DNA copy number of each parental homolog. The second is to combine haplotype copy number and discordant read pairs to construct the sequence of rearranged chromosomes. Finally, we use knowledge from in vitro cell biology experiments to recognize unique rearrangement patterns. I will also discuss strategies to infer the timing of mutational events.
Spring 2017 Schedule: 8:30am Primer, 9:20am Breakfast, 9:30am Seminar, 10:30am Discussion; all in Monadnock
Department of Systems Biology, Harvard Medical School
Structure and fitness from genomic sequences
Abstract: The evolutionary trajectories of biological sequences are propelled by mutation and whittled away by selection to maintain and develop function. Present day sequences can therefore be regarded as the outcomes of millions of evolutionary experiments that record functional constraints in the genotype-phenotype map. In this talk I will first recap the primer by John and Adam that describes how a generative model for sequences can quantify evolutionary constraints on biomolecules in terms of couplings between specific residue combinations. I will show how we have applied this model to predict (i) accurate 3D structures of proteins, RNA and complexes, (ii) conformational plasticity of ‘disordered’ proteins, (iii) quantitative effects of mutations on organism fitness, and (iv) designed sequences of proteins with desired properties. These computational approaches address the challenge of inferring causality from correlations in genetic sequences but can be applied more widely to other biological information such as gene expression or dynamics, cellular phenotypes or drug response. I will introduce challenges and opportunities for extending these methods to diverse biomedical and engineering applications.
John Ingraham and Adam Riesselman
Marks Lab, Department of Systems Biology, Harvard Medical School
Primer: Generative models of biological sequence families
Abstract: Modern genome sequencing and synthesis can acquire and generate tremendous molecular diversity in a day, but our ability to navigate and interpret the exponentially large space of potential biological sequences remains limited. Central to this challenge is the lack of a priori knowledge about epistasis, i.e. non-additive interactions between positions in a molecule or genome. We will describe a class of generative models, discrete undirected graphical models, that, when fit to deep evolutionary sequence variation, can reveal both the three dimensional structures and mutational landscapes of proteins and RNAs, described in more detail in the talk by Debora after the break. In this primer, we will review the math and intuition behind these models, how they require approximate methods for scalable inference, and connections to other common methods in quantitative biology such as partial correlations and logistic regression. Lastly, we will outline how to go beyond pairwise and detect higher order epistasis with neural-network-powered generative models.
U Penn School of Medicine
Gone Fishing: Unsupervised methods for discovery from public data
Abstract: Public gene expression data are abundant. Anybody with an internet connection can download more than 2 million genome-wide assays of gene expression. Learning from these data remains challenging. For example, public data often lack the annotations that enable traditional meta-analysis. If we could surmount these barriers, however, we'd have a valuable resource at our fingertips. Our lab uses machine learning methods to integrate these heterogeneous, noisy, and often poorly or incorrectly annotated data. We focus specifically on algorithms that are unsupervised and robust to noise in order to tackle unannotated data. We've shown that these algorithms can robustly reveal biological features in data from cancer biopsies to microbial systems. And we share these algorithms by building user-friendly software and web servers. Our aim is to make the reproducible analysis of big public data as routine in life sciences labs as wet-bench techniques like PCR.
Primer: Integrating biomedical knowledge to predict new uses for existing drugs
Abstract: How do you teach a computer biology? Our goal was to predict new uses for existing drugs. But we're data scientists, not pharmacologists. So we set out to encode the knowledge from millions of biomedical studies from the last half century. Using a heterogeneous network (hetnet) as our data structure, we were able to condense a large portion of biomedical knowledge into a network with 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. The network is named Hetionet v1.0 and lives at https://neo4j.het.io.
Hetionet enables queries that span many types of information. While such queries were possible before Hetionet, they often took months of data integration, preprocessing, and specialized query scripts. Now complex queries can be written in minutes using the Cypher query language for hetnets. Accordingly, we were able to perform ~47 million queries to assess the connectivity between 136 diseases and 1,538 compounds. Next, we compiled a catalog of 755 disease-modifying treatments and learned which types of network paths could predict whether a compound treats a disease. In total, we predicted probabilities of treatment for 209,168 compound-disease pairs (http://het.io/repurpose). Our method also allows you to compare which types of information were valuable for predicting drug efficacy. Project Rephetio, the codename for this project, was performed openly online in realtime (https://doi.org/bszr). In total, 40 community members provided feedback across 86 project discussions.
Attend the primer to learn more about Project Rephetio & Hetionet as well as hetnets for data integration and the Neo4j graph database. Research continuous as a set of open source GitHub repositories, allowing anyone interested to get involved.
Harvard Chemistry and Chemical Biology
Deep learning chemical space: a variational autoencoder for automatic molecular design
Abstract: Virtual screening is increasingly proven as a tool to test new molecules for a given application. Through simulation and regression we can gauge whether a molecule will be a promising candidate in an automatic and robust way. A large remaining challenge, however, is how to perform optimizations over a discrete space of size at least 10^60. Despite the size of chemical space, or perhaps precisely because of it, coming up with novel, stable, makeable molecules that are effective is not trivial. First-principles approaches to generating new molecules fail to capture the intuition embedded in the ~100 million existing molecules. I will report our progress towards developing an autoencoder that allows us to project molecular space into a continuous, differentiable representation where we can perform molecular optimization.
Carl de Boer
Regev Lab, Broad Institute
Learning the rules of gene regulation with millions of synthetic promoters
Abstract: Gene regulatory programs are encoded in the sequence of the DNA. However, how the cell uses transcription factors (TFs) to interpret regulatory sequence remains incompletely known. Synthetic regulatory sequences can provide insight into this logic by providing additional examples of sequences and their regulatory output in a controlled setting. Here, we have measured the gene expression output of tens of millions of unique promoter sequences, whose expressions span a range of 1000-fold, in a controlled reporter construct. This vast dataset of expression-DNA pairs represents a unique machine learning opportunity, and we use it to build quantitative models of transcriptional regulation based on biochemical principles. Even with a naive “billboard” model of gene regulation (with no positioning or complex TF-interactions), we can explain upwards of 92% of the variation in expression. We gain numerous insights into gene regulation, including a quantitative description of activation, repression, and chromatin modification for each TF, consistent with known TF activities and condition-specific regulators, and even use our data to refine the specificities of TFs. Although a “billboard” model explains the majority of expression in our system, certain TFs show position-, orientation-, and even DNA helical-face-dependent activities. We have so many promoter examples that we can look for potential spacing/orientation-dependent interactions between most TF pairs at base pair resolution, and find certain interactions consistent with biochemical cooperativity. Altogether, the principles learned here help us to better understand when and where TFs bind DNA, what they do when they get there, and how regulatory sequences evolve.
Data Sciences & Data Engineering, Broad
A scalable Bayesian framework for inferring copy number variation
Abstract: Inferring copy number variation (CNV) from next-generation sequencing (NGS) data is a challenging problem. On the one hand, the complexity of the NGS technology results in a highly non-uniform sampling of the genome with unknown latent factors. On the other hand, devising and implementing modern machine learning algorithms for CNV inference in a scalable and robust fashion is an arduous task due to the sheer size of the data. In this talk, we briefly review the existing approaches and glance over a number of their caveats, including difficulty with sex chromosomes, lack of a data-driven model for determining the number of bias latent factors, neglect of sampling noise, heuristic filtering and outlier detection, lack of self-consistency and scalability. Next, we introduce GATK gCNV, our principled and scalable Bayesian framework for germline CNV inference from whole-exome sequencing (WES) and whole-genome sequencing (WGS) data that addresses these caveats. We benchmark GATK gCNV, XHMM and CODEX on WES data against high-confidence Genome STRiP calls on matched WGS data as ground truth, and show that GATK gCNV yields up to 30 percent higher sensitivity and specificity compared to the existing tools. We conclude the talk with a brief discussion of our ongoing efforts toward addressing the difficulty with common and large CNV events, and generalization to somatic CNV inference.
Data Sciences & Data Engineering, Broad
Primer: Bayesian PCA
Abstract: The model at the heart of GATK gCNV builds heavily on the probabilistic and Bayesian approaches to principal component analysis (PCA). In contrast with traditional PCA, the probabilistic approach provides a predictive model that can account for missing data, while the fully Bayesian approach further enables a principled way to learn the effective dimensionality of the principal subspace (i.e., the appropriate number of principal components to use). Model inference can be performed using expectation-maximization and variational-Bayesian methods, respectively. We will give a pedagogical overview of these methods, drawing analogies between Bayesian PCA and the perhaps more familiar Gaussian mixture model.
MIT IDSS, CSAIL, EECS
Probabilistic models of diversity: applications and algorithms for determinantal point processes
Abstract: Determinantal Point Processes (DPPs) are gaining popularity in machine learning as elegant probabilistic models of diversity. In other words, these are probability distributions over subsets of a collection of items (data points, features, ...) that prefer diverse subsets. In particular, many computations that are difficult with other models "simply" reduce to linear algebra for DPPs. DPPs have been known to arise in statistical physics, combinatorial probability and random matrix theory, and certain approximation algorithms. The first part of this talk will survey machine learning-related applications of DPPs, from recommendation, feature selection and improving interpretability to matrix approximations for kernel methods and pruning of neural networks.
Despite their ease of modeling, the wide applicability of DPPs has been hindered by computationally expensive sampling algorithms. The second part of the talk will address recent progress in sampling algorithms for DPPs and its implications in theory and practice. Most of the talk will be tutorial-style and does not require any prior knowledge of DPPs.
Based on joint work with Chengtao Li and Suvrit Sra.
MIT IDSS, CSAIL, EECS
Primer: A primer on determinantal point processes
Abstract: The primer will be a short tutorial that introduces Determinantal Point Processes with a bit of detail and intuition, explains its relations to diversity, basic computations, and important models.
Julien de Wit, Nicolas Wieder
MIT Earth And Planetary Sciences, Greka Lab
A pseudo-random walk from new worlds to diabetes
Abstract: For centuries, our understanding of planetary systems has been based on observations of a unique sample, the Solar System. Similarly, our perspective on Life and habitats has remained Earth-centric, leaving millennia-old questions such as "Are we alone? Where/How/When did Life emerge?" unanswered. Two decades ago, the first planet orbiting another star than ours—a.k.a. an exoplanet—was discovered, opening a new chapter of space exploration. Since then, over 3,500 exoplanets have been found in over 2,500 other systems; a sample size increase of three orders of magnitude that has already yielded profound changes in our understanding of planetary systems. Similar changes await our perspective on Life and habitats within the next generation. During this talk, a “Searching for New Worlds 101” will be provided to introduce the TRAPPIST – 1 system, exploring our recent discovery of Earth-sized planets that are both potentially habitable and amenable for in-depth studies with upcoming observatories, and the first insights into their atmospheres, as revealed by the Hubble Space Telescope.
At the other end of the scale, biology focuses on chemical processes within cells rather than within atmospheres. A fundamental—and yet mostly overlooked—set of cellular processes gravitates around transient calcium signals. The availability of fast fluorescent calcium indicators allows for the measurements of intracellular calcium and thus provides direct observables of pathological and physiological calcium fluctuations. Calcium signals thereby offer new perspectives to approach a variety of diseases, from diabetes and metabolic disease to Alzheimer's disease.
Interestingly, these seemingly diverse fields of biology and planetary sciences share a common cornerstone: (Spectro)Photometric time series. With the arrival of high throughput facilities (e.g. TESS for exoplanetary sciences; FLIPR for biology), the need for standardized data acquisition/processing tools has emerged. The inherent similarity between these fields, in terms of multidisciplinarity and datatype, allows for mutually-beneficial collaborations that need to be leveraged to support the optimal sampling of yet unexplored parameter spaces, and their unbiased interpretation.
Lander Lab, Broad Institute
Grand Challenge: Mapping the regulatory wiring of the genome
Abstract: Our cells are controlled by complex molecular instructions encoded in the "noncoding" sequences of our genome, and alterations to these noncoding sequences underlie many common human diseases. The grammar of these noncoding sequences has been difficult to study, but the recent confluence of methods for both high-throughput measurement and high-throughput perturbation offers new opportunities to understand these sequences at a systems level. In this talk, I will highlight outstanding challenges in gene regulation where applying computational approaches in combination with emerging genomics datasets may allow us to build integrated maps that describe the regulatory wiring of the genome. As an example, I will present our efforts to experimentally and computationally map the functional connections between promoters and distal enhancers and use this information to understand human genetic variation in the noncoding genome.
Composing graphical models with neural networks for structured representations and fast inference
Abstract: I'll describe a new modeling and inference framework that combines the flexibility of deep learning with the structured representations of probabilistic graphical models. The model family augments latent graphical model structure, like switching linear dynamical systems, with neural network observation likelihoods. To enable fast inference, we show how to leverage graph-structured approximating distributions and, building on variational autoencoders, fit recognition networks that learn to approximate difficult graph potentials with conjugate ones. I'll show how these methods can be applied to learn how to parse mouse behavior from depth video.
Columbia, Blei Lab
Primer: Bayesian time series modeling with recurrent switching linear dynamical systems
Abstract: Many natural systems like neurons firing in the brain or basketball teams traversing a court give rise to time series data with complex, nonlinear dynamics. We gain insight into these systems by decomposing the data into segments that are each explained by simpler dynamical units. Bayesian time series models provide a flexible framework for accomplishing this task. This primer will start with the basics, introducing linear dynamical systems and their switching variants. With this background in place, I will introduce a new model class called recurrent switching linear dynamical systems (rSLDS), which discover distinct dynamical units as well as the input- and state-dependent manner in which units transition from one to another. In practice, this leads to models that generate much more realistic data than standard SLDS. Our key innovation is to design these recurrent SLDS models to enable recent Pólya-gamma auxiliary variable techniques and thus make approximate Bayesian learning and inference in these models easy, fast, and scalable.
Wellcome Trust Centre for Human Genetics, Oxford
Simulating, storing and processing genetic variation data for millions of samples
Abstract: Coalescent theory has played a key role in modern population genetics and is fundamental to our understanding of genetic variation. While simulation has been essential to coalescent theory from its beginnings, simulating realistic population-scale genome-wide data sets under the exact model was, until recently, considered infeasible. Even under an approximate model, simulating more than a few tens of thousands samples was very time consuming and could take several weeks to complete a single replicate. However, by encoding simulated genealogies using a new data structure (called a tree sequence), we can we now simulate entire chromosomes for millions of samples under the exact coalescent model in a few hours. We discuss some applications that these simulations have made possible, including a study of biases in human GWAS and the systematic benchmarking of variant processing tools at scale. The tree sequence data structure is also an extremely concise way of representing genetic variation data, and we show how variant data for millions of simulated human samples can be stored in only a few gigabytes. Moreover, we show that this very high level of compression does not incur a decompression cost. Because the information is represented in terms of the underlying genealogies, operations such as computing allele frequencies on sample subsets or measuring of linkage disequilibrium can be made very efficient. Finally, we discuss ongoing work on inferring tree sequences from observed data and present some preliminary results.
Department of Biomedical Informatics, Harvard Medical School
From one to millions of cells: computational challenges in single-cell analysis
Abstract: Over the last five years, our ability to isolate and analyze detailed molecular features of individual cells has expanded greatly. In particular, the number of cells measured by single-cell RNA-seq (scRNA-seq) experiments has gone from dozens to over a million cells, thanks to improved protocols and fluidic handling. Analysis of such data can provide detailed information on the composition of heterogeneous biological samples, and variety of cellular processes that altogether comprise the cellular state. Such inferences, however, require careful statistical treatment, to take into account measurement noise as well as inherent biological stochasticity. I will discuss several approaches we have developed to address such problems, including error modeling techniques, statistical interrogation of heterogeneity using gene sets, and visualization of complex heterogeneity patterns, implemented in PAGODA package. I will discuss how these approaches have been modified to enable fast analysis of very large datasets in PAGODA2, and how the flow of typical scRNA-seq analysis can be adapted to take advantage of potentially extensive repositories of scRNA-seq measurements. Finally, I will illustrate how such approaches can be used to study transcriptional and epigenetic heterogeneity in human brains.
Harvard Medical School, Kharchenko Lab
Primer: Linking genetic and transcriptional intratumoral heterogeneity at the single cell level
Broad Fellow, Chemical Biology & Therapeutic Sciences
Continuous directed evolution: advances, applications, and opportunities
Abstract: The development and application of methods for the laboratory evolution of biomolecules has rapidly progressed over the last few decades. Advancements in continuous microbe culturing and selection design have facilitated the development of new technologies that enable the continuous directed evolution of proteins and nucleic acids. These technologies have the potential to support the extremely rapid evolution of biomolecules with tailor-made functional properties. Continuous evolution methods must support all of the key steps of laboratory evolution — translation of genes into gene products, selection or screening, replication of genes encoding the most fit gene products, and mutation of surviving genes — in a self-sustaining manner that requires little or no researcher intervention. In this presentation, I will describe the basis and applications of our Phage-Assisted Continuous Evolution (PACE) platform, solutions we have devised to address known limitations in the technique, and opportunities to improve PACE where in silico computation may play a key role. Through these tools, we aspire to enable researchers to address increasingly complex biological questions and to access biomolecules with novel or even unprecedented properties.
MIT EECS, CSAIL, and IDSS
Edge-exchangeable graphs, clustering, and sparsity
Abstract: Many popular network models rely on the assumption of (vertex) exchangeability, in which the distribution of the graph is invariant to relabelings of the vertices. However, the Aldous-Hoover theorem guarantees that these graphs are dense or empty with probability one, whereas many real-world graphs are sparse. We present an alternative notion of exchangeability for random graphs, which we call edge exchangeability, in which the distribution of a graph sequence is invariant to the order of the edges. We demonstrate that a wide range of edge-exchangeable models, unlike any models that are traditionally vertex-exchangeable, can exhibit sparsity. To develop characterization theorems for edge-exchangeable graphs analogous to the powerful Aldous-Hoover theorem for vertex-exchangeable graphs, we turn to a seemingly different combinatorial problem: clustering. Clustering involves placing entities into mutually exclusive categories. A "feature allocation" relaxes the requirement of mutual exclusivity and allows entities to belong simultaneously to multiple categories. In the case of clustering the class of probability distributions over exchangeable partitions of a dataset has been characterized (via "exchangeable partition probability functions” and the "Kingman paintbox"). These characterizations support an elegant nonparametric Bayesian framework for clustering in which the number of clusters is not assumed to be known a priori. We show how these characterizations can be extended to feature allocations and, from there, to edge-exchangeable graphs.
Primer: Nonparametric Bayesian Models, methods, and applications
Abstract: Nonparametric Bayesian methods make use of infinite-dimensional mathematical structures to allow the practitioner to learn more from their data as the size of their data set grows. What does that mean, and how does it work in practice? In this tutorial, we'll cover why machine learning and statistics need more than just parametric Bayesian inference. We'll introduce such foundational nonparametric Bayesian models as the Dirichlet process and Chinese restaurant process and touch on the wide variety of models available in nonparametric Bayes. Along the way, we'll see what exactly nonparametric Bayesian methods are and what they accomplish.
Helmholtz Zentrum München, TU Munich
Reconstructing trajectories and branching lineages in single cell genomics
Abstract: Single-cell technologies have gained popularity in developmental biology because they allow resolving potential heterogeneities due to asynchronicity of differentiating cells. Common data analysis encompasses normalization, followed by dimension reduction and clustering to identify subgroups. However, in the case of cellular differentiation, we may not expect clear clusters to be present - instead cells tend to follow continuous branching lineages.
In this talk I will first review methods for pseudotime ordering of cells according to their single cell profiles, which are used for reconstructing such trajectories. Then I will show that modeling the high-dimensional state space as a diffusion process, where cells move to close-by cells with a distance-dependent probability well reflects the differentiating characteristics. Based on the underlying diffusion map transition kernel, cells can be ordered according to a diffusion pseudotime (DPT), which allows for a robust identification of branching decisions and corresponding trajectories of single cells. After application to blood stem cell differentiation, I finish with current extensions towards single cell RNAseq time series and population models as well as driver-gene identification.
Center for Science and the Imagination, Arizona State University
What Algorithms Want
Abstract: We depend on — we believe in — algorithms to help us get a ride, choose which book to buy, execute a mathematical proof. It is as if we think of code as a magic spell, an incantation to reveal what we need to know and even what we want. But how do we navigate the gap between what algorithms really do and all the things we think, and hope, they do? This talk explores the evolving figure of the algorithm as it bridges the idealized space of computation and messy reality, with unpredictable and sometimes fascinating results. Drawing on sources that range from Neal Stephenson’s “Snow Crash” to Diderot’s “Encyclopédie,” from Adam Smith to the “Star Trek” computer, Finn explores the gap between theoretical ideas and pragmatic instructions, and the consequences of that gap for research at the intersection of computation and culture.
Fall 2016 Schedule: 8:30am Primer, 9:20am Breakfast, 9:30am Seminar, 10:30am Discussion; all in Monadnock
Regev and Lander Labs
Broad Institute and MIT CSBi
Composite measurements and molecular compressed sensing for efficient transcriptomics at scale
Abstract: Comprehensive RNA profiling provides an excellent phenotype of cellular responses and tissue states, but can be prohibitively expensive to generate at the massive scale required for studies of regulatory circuits, genetic states or perturbation screens. However, because expression profiles may reflect a limited number of degrees of freedom, a smaller number of measurements might suffice to capture most of the information. Here, we use existing mathematical guarantees to demonstrate that gene expression information can be preserved in a random low dimensional space. We propose that samples can be directly observed in low dimension through a fundamentally new type of measurement that distributes a single readout across many genes. We show by simulation that as few as 100 of these randomly composed measurements are needed to accurately estimate the global similarity between any pair of samples. Furthermore, we show that methods of compressive sensing can be used to recover gene abundances from drastically under-sampled measurements, even in the absence of any prior knowledge of gene-to-gene correlations. Finally, we propose an experimental scheme for such composite measurements. Thus, compressive sensing and composite measurements can become the basis of a massive scale up in the number of samples that can be profiled, opening new opportunities in the study of single cells, complex tissues, perturbation screens and expression-based diagnostics.
Automated Inference and the Promise of Probabilistic Programming
Abstract: Generative probability models allow us to 1) express assumptions about hidden patterns in data, 2) infer such hidden patterns, and 3) evaluate the accuracy of our findings.
However, designing modern models, developing custom inference algorithms, and evaluating accuracy requires enormous effort and cross-disciplinary expertise. Probabilistic programming promises to enable this process by making each step less arduous and more automated.
I will begin describing how probabilistic programming can help design modern probability models. I will then focus on automating inference for a wide class of probability models. To this end, I will describe automatic differentiation variational inference, a fully automated approximate inference algorithm. I will demonstrate its application to a mixture modeling analysis of a dataset with millions of observations. I intend to conclude with some thoughts on model evaluation, with a population genetics example.
Throughout this talk, I will highlight connections to our software project, Edward: a Python library for probabilistic modeling, inference, and evaluation.
Primer: Probabilistic Generative Models and Posterior Inference
Abstract: To model data we desire to express assumptions about the data, infer hidden structure, make predictions, and simulate new data. In this talk, I will describe how probabilistic generative models provide a common toolkit to meet these challenges. I will first present these ideas in a toy setting followed by discussing the range of probabilistic generative models from structural to algorithmic. Next I will present an in depth view of deep exponential families, a class of probability models containing both predictive and interpretive models. I will end with the central computational problem in realizing the promise of probabilistic generative models: posterior inference. I will demonstrate why deriving inference is tedious and will touch on black box variational methods which seek to alleviate this burden.
Dana-Farber Cancer Institute
Harvard School Public Health
Overcoming Bias and Batch Effects in High-Throughput Data
Abstract: The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming science. In the life sciences, data analysis is now part of practically every research project. Genomics, in particular, is being driven by new measurement technologies that permit us to observe certain molecular entities for the first time. These observations are leading to discoveries analogous to identifying microorganisms and other breakthroughs permitted by the invention of the microscope. An examples of this are the many application of next generation sequencing.
Biases, systematic errors and unexpected variability are common in biological data. Failure to discover these problems often leads to flawed analyses and false discoveries. As datasets become larger, the potential of these biases to appear to be significant actually increases. In this talk I will describe several examples of these challenges using very specific examples from gene expression microarrays, RNA-seq, and single-cell assays. I will describe data science solution to these problems.
Harvard Sys Bio, HST
Primer: Experimental and computational techniques underlying RNA-seq
Abstract: We will provide an overview of the experimental and computational steps involved in RNA-seq for both bulk and single-cell experiments. We will begin with a brief review of Illumina short-read sequencing by synthesis; continue to describing the molecular biology used in preparing RNA-seq libraries; and discuss quality trimming, read alignment, transcript quantification and normalization of gene expression measures. We will conclude with a discussion of techniques commonly leveraged in single-cell RNA-Seq: linear pre-amplification, unique molecular identifiers (UMI/RMTs) and 3’-barcode counting. Throughout the primer, we will mention potential sources of bias that can be introduced at each step and why they occur.
Warren Center for Network and Data Sciences
Complex Systems Group & Department of Mathematics
University of Pennsylvania
What can persistent homology see?
Abstract: The usual framework for TDA takes as its starting point that a data set is sampled (noisily) from a manifold embedded in a high dimensional space, and provides a reconstruction of topological features of that manifold. However, the underlying algebraic topology can be applied to data in a much broader sense, carries much richer information about the system than just the barcodes, and can be fine-tuned so it sees only features of the data we want it to see. I will discuss this framework broadly, with focus on few of these alternative viewpoints, including applications to neuroscience and matrix factorization.
Functional Cancer Genomics
Broad Institute of MIT and Harvard
Primer: What is persistent homology?
Abstract: A fundamental question in big data analysis is if or how these points may be sampled, noisily, from an intrinsically low-dimensional geometric shape, called a manifold, embedded in a high dimensional “sensor” space. Topological data analysis (TDA) aims to measure the “intrinsic shape” of data and identify this manifold despite noise and the likely nonlinear embedding. I will discuss the basics of the fundamental tool in TDA called persistent homology, which assigns to a point cloud a count of topological features –roughly “holes” of various dimensions – with a measure of importance of each feature recorded in a “barcode” of the data to help distinguish the significant features from the noise.
Harvard Medical School
FIDDLE: An integrative deep learning framework for functional genomic data inference
Abstract: Numerous advances in sequencing technologies have revolutionized genomics through generating many types of genomic functional data. Statistical tools have been developed to analyze individual data types, but there lack strategies to integrate disparate datasets under a unified framework. Moreover, most analysis techniques heavily rely on feature selection and data preprocessing which increase the difficulty of addressing biological questions through the integration of multiple datasets. Here, we introduce FIDDLE (Flexible Integration of Data with Deep LEarning) an open source data-agnostic flexible integrative framework that learns a unified representation from multiple data types to infer another data type. As a case study, we use multiple Saccharomyces cerevisiae genomic datasets to predict global transcription start sites (TSS) through the simulation of TSS-seq data. We demonstrate that a type of data can be inferred from other sources of data types without manually specifying the relevant features and preprocessing. We show that models built from multiple genome-wide datasets perform profoundly better than models built from individual datasets. Thus, FIDDLE learns the complex synergistic relationship within individual datasets and, importantly, across datasets.
Primer: Automatic differentiation, the algorithm behind all deep neural networks
Abstract: A painful and error-prone step of working with gradient-based models (deep neural networks being one kind) is actually deriving the gradient updates. Deep learning frameworks, like Torch, TensorFlow and Theano, have made this a great deal easier for a limited set of models — these frameworks save the user from doing any significant calculus by instead forcing the framework developers to do all of it. However, if a user wants to experiment with a new model type, or change some small detail the developers hadn’t planned, they are back to deriving gradients by hand. Fortunately, a 30+ year old idea, called “automatic differentiation”, and a one year old machine learning-oriented implementation of it, called “autograd”, can bring true and lasting peace to the hearts of model builders. With autograd, building and training even extremely exotic neural networks becomes as easy as describing the architecture. We will also address two practical questions — "What's the difference between all these deep learning libraries?" and "What does this all mean to me, as a biologist?" — as well as providing some detail and historical perspective on the topic of automatic differentiation.
Dept. of Genetics
Dept. of Computer Science
Integrative, interpretable deep learning frameworks for regulatory genomics and epigenomics
Abstract: We present generalizable and interpretable supervised deep learning frameworks to predict regulatory and epigenetic state of putative functional genomic elements by integrating raw DNA sequence with diverse chromatin assays such as ATAC-seq, DNase-seq or MNase-seq. First, we develop novel multi-channel, multi-modal CNNs that integrate DNA sequence and chromatin accessibity profiles (DNase-seq or ATAC-seq) to predict in-vivo binding sites of a diverse set of transcription factors (TF) across cell types with high accuracy. Our integrative models provide significant improvements over other state-of-the-art methods including recently published deep learning TF binding models. Next, we train multi-task, multi-modal deep CNNs to simultaneously predict multiple histone modifications and combinatorial chromatin state at regulatory elements by integrating DNA sequence, RNA-seq and ATAC-seq or a combination of DNase-seq and MNase-seq. Our models achieve high prediction accuracy even across cell-types revealing a fundamental predictive relationship between chromatin architecture and histone modifications. Finally, we develop DeepLIFT (Deep Linear Importance Feature Tracker), a novel interpretation engine for extracting predictive and biological meaningful patterns from deep neural networks (DNNs) for diverse genomic data types. DeepLIFT is the first method that can integrate the combined effects of multiple cooperating filters and compute importance scores accounting for redundant patterns. We apply DeepLIFT on our models to obtain unified TF sequence affinity models, infer high resolution point binding events of TFs, dissect regulatory sequence grammars involving homodimer and heterodimeric binding with co-factors, learn predictive chromatin architectural features and unravel the sequence and architectural heterogeneity of regulatory elements.
University of Toronto
Algorithms for reconstructing tumor evolution
Abstract: Tumors contain genetically heterogeneous cancerous subpopulations that can differ in their metastatic potential and response to treatment. Our work over the past few years has focused on using computational and statistical methods to reconstruct the phylogeny and the full genotypes of these subpopulations using data from high-throughput sequencing of tumor samples.
Tumor subpopulations can be partially characterised by identifying tumor-associated somatic variants using short read sequencing. Subsequent inference of copy number variants or clustering of the variant allele frequencies (VAFs) can reveal the number of major subpopulations present in the tumor as well as the set of mutations which first appear in each subpopulation. Further analysis, and often different data, is needed to determine how the subpopulations relate to one another and whether they share any mutations. Ideally, this analysis would reconstruct the full genotypes of each subpopulation.
I will describe my lab’s efforts to recover these full genotypes by reconstructing the tumor’s evolutionary history. We do this by fitting subpopulation phylogenies to the VAFs. In some circumstances, a full reconstruction is possible but often multiple phylogenies are consistent with the data. We have developed a number of methods (PhyloSub, PhyloWGS, treeCRP, PhyloSpan) that use Bayesian inference in non-parametric models to distinguish ambiguous and unambiguous portions of the phylogeny thereby explicitly representing reconstruction uncertainty. Our methods consider both single nucleotide variants as well as copy number variations and adapt to data on pairs of mutations.
Data Sciences & Data Engineering
Primer: Intro to Dirichlet Processes
Abstract: At a mundane level, Dirichlet processes are a clustering algorithm that determines the number of clusters. However, they are also a way to do Bayesian inference on a single infinite model rather than ad hoc model selection on a series of finite models and are the gateway to the field of Bayesian non-parametric models. Many introductions to Dirichlet processes take a formal measure-theoretic approach. In contrast, if you can understand the multinomial distribution you will understand this primer.
Deep learning for computational pathology
Abstract: In this talk, we will provide an introduction to computational pathology, which is an emerging cross-discipline between pathology and computer engineering. Besides, we will introduce a deep learning-based automatic whole slide image analysis system for the identification of cancer metastases in breast sentinel lymph nodes. Our system won the 1st position in the International Challenge: Camelyon16, which was held at the International Symposium on Biomedical Imaging (ISBI) 2016. The system achieved an area under the receiver operating curve (AUC) of 0.925 for the task of whole slide image classification and an average sensitivity of 0.705 for the tumor localization task. A pathologist independently reviewed the same images, obtaining a whole slide image classification AUC of 0.966 and a tumor localization score of 0.733. By combining the predictions from the human pathologist and the automatic analysis system, the performance becomes even higher. These results demonstrate the power of using deep learning to produce significant improvements in the accuracy of pathological diagnoses.
Babak Ehteshami Bejnordi
Beck Lab, Harvard Medical School at Beth Israel Deaconess Medical Center
Primer: Practical recommendations for training convolutional neural nets
Abstract: Deep learning, in particular convolutional neural network (ConvNet), is rapidly emerging as one of the most successful approaches for image and speech recognition. What distinguishes ConvNets and other deep learning systems from conventional machine learning techniques is their ability to learn the entire perception process from end to end. Deep learning systems use multiple nonlinear processing layers to learn useful representations of features directly from data.
Searching the parameter space of deep architectures is a complex optimization task. ConvNets can be very sensitive to the setting of their hyper-parameters and network architecture setting. In this talk, I will give practical recommendations for training ConvNets and discuss the motivation and principles behind them. I will also provide recommendations on how to tackle various problems in analyzing medical image data such as lack of data, highly skewed class distributions, etc.
Finally, I will introduce some of the advanced ConvNet architectures used in medical image analysis and their suitability for various tasks such as detection, classification, and segmentation.
Spectral unmixing for next-generation mass spectrometry proteomics
Abstract: Mass spectrometry proteomics is the method of choice for large-scale quantitation of proteins in biological samples, allowing rapid measurement of the concentrations of thousands of proteins in various modified forms. However, this technique still faces fundamental challenges in terms of reproducibility, bias, and comprehensiveness of proteome coverage. Next-generation mass spectrometry, also known as data-independent acquisition, is a promising new approach with the potential to measure the proteome in a far more comprehensive and reproducible fashion than existing methods, but it has lacked a computational framework suited to the highly convoluted spectra it inherently produces. I will discuss Specter, an algorithm that employs linear unmixing to disambiguate the signals of individual proteins and peptides in next-generation mass spectra. In addition to describing the linear algebra underlying Specter, we'll discuss its implementation in Spark with Python, and see several real datasets to which it's been applied.
Primer: Mass spectrometry-based proteomics
Abstract: Mass spectrometry is the workhorse technology to study the abundance and composition of proteins, the key players in every living cell. Within the last decade the technology experienced a revolution in terms of novel instrumentation and optimized sample handling protocols resulting in ever growing numbers of proteins and post-translational modifications that can be routinely studied on a system-wide scale. Briefly, proteins are extracted from cells or tissues and fragmented into smaller peptides. This extremely complex peptide mixture is subjected to liquid chromatography separation and subsequent tandem mass spectrometry analysis in which mass-to-charge ratios of intact peptides and peptide fragments are recorded. Resulting mass spectra are matched to sequence databases or spectral libraries to read out the amino acid sequences and thereby identify the corresponding proteins.
The technology is fundamentally different from sequencing-based genomics technology and faces different problems, such as the tremendous dynamic range of protein expression. The instruments can be operated in different acquisition modes for different applications. I will briefly introduce the basics behind discovery or ‘shotgun’ proteomics, targeted proteomics, data dependent acquisition and data independent acquisition; the latter is a recent and promising development in the proteomics community but poses novel and only partly solved challenges in data analysis. Ryan Peckner will talk about Specter, an approach that tackles this problem using linear algebra.
Compiling probabilistic programs
Abstract: Deriving and implementing an inference algorithm for a probabilistic model can be a difficult and error-prone task. Alternatively, in probabilistic programming, a compiler is used to transform a model into an inference algorithm. In this talk, we'll present probabilistic programming from the perspective of a compiler writer. A compiler for a traditional language uses intermediate languages (ILs) and static analysis to generate efficient code. We'll highlight how these ideas can be used in probabilistic programming for generating flexible and scalable inference algorithms.
Hail Team, Neale Lab
Primer: What is a compiler?
Abstract: A compiler is an algorithm that transforms a source language into a target language. The transformation typically includes an optimizing pass which reduces memory or time requirements. Classic compilers transform languages such as C or Java into near-machine code such as x86 Assembly or JVM Bytecode. Recent work on Domain Specific Languages (DSLs) expands the notion of "source language" in order to enable everyone to build easy-to-reason-about abstractions without the performance penalty. In this context, I will discuss compiler design and implementation techniques with examples.
On Nov 12, the Broad welcomed a visit from Ryan Adams, a leader in machine learning - a field at the intersection of applied math and computer science that develops models and algorithms to learn from data...
Spring 2016 Schedule: 8:30am Primer, 9:30am Seminar, 10:30am Discussion in Monadnock
|Jan 27||Brendan Frey||
Toronto Eng / Med / CS,
CEO Deep Genomics
|Genomic Medicine: Will Software Eat Bio?|
|Feb 3||Shamil Sunyaev||HMS, Brigham & Women’s, Broad||Judging the importance of human mutations using evolutionary models|
|Feb 10||David Benjamin [slides]||DSDE||Principal component analysis (PCA)|
|Jeremy Gunawardena [slides]||Harvard Systems Biology||Systems biology: can mathematics lead experiments?|
|Feb 17||Jon Bloom||Hail Team, Neale Lab||Frequentist vs Bayesian inference|
|Caroline Uhler [video, slides]||MIT IDSS / EECS||Gene Regulation in Space and Time|
|Feb 24||Giulio Genovese||McCarrol Lab||Hidden Markov models I [notes]|
|Geoffrey Schiebinger||Berkeley Stats||Sparse Inverse Problems [paper, paper]|
|Mar 2||Giulio Genovese||McCarrol Lab||Hidden Markov models II [notes]|
|Leonid Mirny [slides, v1, v2]||MIT Physics, HST||Polymer models of chromosomes|
|Mar 9||Hilary Finucane||MIT Math, Price Lab||Linear models I: ordinary least squares|
|Po-Ru Loh [slides]||HSPH, Price Lab||Haplotype phasing in large cohorts: Modeling, search, or both? [paper, code]|
|Mar 16||Yakir Reshef||HMS, HIPS||Linear models II: regularization and ridge [ipynb]|
|Jeremy Freeman [video]||Janelia Research Campus, HHMI||Open source tools for large-scale neuroscience [paper]|
|Mar 23||Alex Bloemendal||Hail Team, Neale Lab||Linear models III: regularization, LASSO and sparsity|
Google Research Cambridge
|A quick introduction to TensorFlow and related API's [cell_paper]|
|Mar 30||David Rolnick||MIT Math||Non-negative matrix factorization (NMF) [paper]|
|John Wakeley||Harvard OEB (Chair)||The effects of population pedigrees on gene genealogies [coalescent background]|
|Apr 6||David Kelley||Rinn Lab||Convolutional neural nets [background, paper]|
|Abraham Heifets||CEO Atomwise||AtomNet: a deep convolutional neural net for bioactivity prediction in structure-based drug discovery [paper]|
|Apr 13||Joseph Nasser [slides]||Connectivity Map||t-dist. stochastic neighbor embedding (t-SNE) [paper, homepage]|
Broad Imaging Platform
|Information in Cell Images: Targeting Diseases and Characterizing Compounds [paper]|
|Apr 20||Manuel Rivas [slides]||Daly Lab||Multiple testing and false discovery rate [paper]|
|Joshua Weinstein||Zhang Lab||DNA microscopy and the sequence-to-image inverse problem|
|Apr 27||Ryan Peckner [video]||GPP||Linear codes [book]|
|Yaniv Erlich [video, slides]||
NY Genome Center
|Compressed experiments [paper]|
David Tse [video, slides]
1pm - 2pm, Yellowstone
|The Science of Information: Case Studies from DNA and RNA Assembly [paper, paper]|
|May 4||Yakir Reshef [video]||HMS, MIT, HIPS||Gaussian processes [ipynb, kernel cookbook]|
|Barbara Engelhardt [video]||Princeton CS||Bayesian structured sparsity: rethinking sparse regression [paper]|
|May 11||David Benjamin [video, slides]||DSDE||Variational Bayesian inference|
|David Blei [video, intro slides]||
Columbia Data Science,
Columba CS / Stats
|Scaling and Generalizing Variational Inference [paper, edward]|
|May 18||Tim Poterba, Jon Bloom [video, outro slides]||Hail Team, Neale Lab||Basic introduction to distributed computation [mapReduce, deepNets, scalingBayes]|
|Matei Zaharia [video, slides]||
MIT CSAIL, EECS
Co-founder, CTO Databricks
|Scaling data analysis with Apache Spark [sparkRDD, MOOC]|
|May 25||No primer|
|MIA breakfast social|
|Jun 1||No primer|
CS, Genomics, EE
|Identifying molecular markers for cancer treatment from big data [paper]|
Genomic Medicine: Will Software Eat Bio?
Abstract: Deep learning will transform biology and medicine, but not in the way that many advocates think. Downloading ten thousand genomes and training a neural network to predict disease won't cut it. It is overly simplistic to believe that deep learning, or machine learning in general, can successfully be applied to genome data without taking into account biological processes that connect genotype to phenotype. The amount of data multiplied by the mutation frequency divided by the biological complexity and the number of hidden variables is too small. I’ll describe a rational “software meets bio” approach that has recently emerged in the research community and that is being pursued by dozens of young investigators. The approach has improved our ability to “read the genome”, and I believe it will have a significant impact on genome biology and medicine. I'll discuss which applications are ripe and which are merely seductive, how we should train models to take advantage of new types of data, and how we can interpret machine learning models.
Professor, Harvard Medical School
Research Geneticist, Brigham & Women’s Hospital
Associate Member, Broad Institute
Judging the importance of human mutations using evolutionary models
Abstract: Many forces influence the fate of alleles in populations, and the detailed quantitative description of the allelic dynamics is complex. However, some applications allow for simplifications making the evolutionary models useful in the context of human genetics. The examples include comparative genomics and the analysis of large scale sequencing datasets.
Professor, Harvard Medical School
Department of Systems Biology
Systems biology: can mathematics lead experiments?
Abstract: The -omic revolution in biology, and parallel developments in microscopy and imaging, have opened up fascinating new opportunities for analysing biological data using tools from the mathematical sciences. However, the kind of data we have and the way we interpret them are determined by the conceptual landscape through which experimentalists reason about biology. In this talk, I will consider how mathematics can help to shape that conceptual landscape and thereby suggest new experimental strategies. I will describe some of our recent work on how eukaryotic genes are regulated, which tries to update conventional thinking in this field, which is largely derived from bacterial studies, and I will point out how this exercise gives rise to mathematical conjectures for which we currently have no solutions.
MIT EECS, IDSS
Gene Regulation in Space and Time
Abstract: Although the genetic information in each cell within an organism is identical, gene expression varies widely between different cell types. The quest to understand this phenomenon has led to many interesting mathematics problems. First, I will present a new method for learning gene regulatory networks. It overcomes the limitations of existing algorithms for learning directed graphs and is based on algebraic, geometric and combinatorial arguments. Second, I will analyze the hypothesis that the differential gene expression is related to the spatial organization of chromosomes. I will describe a bi-level optimization formulation to find minimal overlap configurations of ellipsoids and model chromosome arrangements. Analyzing the resulting ellipsoid configurations has important implications for the reprogramming of cells during development.
UC Berkeley, Department of Statistics
Sparse Inverse Problems
Abstract: What can we learn by observing nature? How can we understand and predict natural phenomena? This talk is on the mathematics of precision measurement. How can we solve for the input that generated the output of some measurement apparatus? Our starting point is an information theoretic prior of sparsity. We investigate sparse inverse problems where we assume the input can be described by a small number of parameters. We introduce some of our recent theoretical results in superresolution and in spectral clustering. In particular, we show how to solve infinite dimensional deconvolution problems with finite dimensional convex optimization. And we show why dimensionality reduction can be such a useful preprocessing step for mixture models.
MIT Physics, Health Sciences and Technology
Polymer models of chromosomes
Abstract: DNA of the human genome is 2m long and is folded into a structure that fits in a cell nucleus. One of the central physical questions here is the question of scales: How can microscopic processes of molecular interactions of nanometer scale drive chromosomal organization at microns? Inferring principles of 3D organization of chromosomes from a range of biological data is a challenging biophysical problem. We develop a top-down approach to biophysical modeling of chromosomes. Starting with a minimal set of biologically motivated interactions we build polymer models of chromosome organization that can reproduce major features observed in Hi-C and microscopy experiments. I will present our work on modeling organization of human metaphase and interphase chromosomes.
Harvard School of Public Health, Price Lab
Haplotype phasing in large cohorts: Modeling, search, or both?
Abstract: Inferring haploid phase from diploid genotype data -- "phasing" for short -- is a fundamental question in human genetics and a key step in genotype imputation. How should one go about phasing a large cohort? The answer depends on how large. In this talk, I will contrast two approaches to computational phasing: hidden Markov models (HMMs), which perform precise but computationally expensive statistical inference, and long-range phasing (LRP), which relies instead on rapidly searching for long genomic segments shared among samples. I will present a new LRP method (Eagle), describe its performance on N=150,000 UK Biobank samples, and discuss future directions.
Janelia Research Campus, HHMI
Open source tools for large-scale neuroscience
Abstract: Modern computing and the web are both enabling and changing how we do science. Using neuroscience as an example, I will highlight some of these developments, spanning a surprising diversity of technologies. I'll discuss distributed computing for data analytics, cloud computing and containerization for reproducibility, peer-to-peer networks for sharing data and knowledge, functional reactive programming for hardware control, and webgl for large-scale interactive experiments. And I will describe several open source projects we and others are working on across these domains. I hope to convey both what we're learning about the brain with these approaches, and how science itself is evolving in the process.
Google Brain, Google Research Cambridge
A quick introduction to TensorFlow and related API's
Abstract: TensorFlow was recently released to the open source world as a platform for developing cutting-edge ML models, with an emphasis on deep architectures including neural nets, convolutional neural nets, recurrent neural nets, and LSTM's. The open source version of TensorFlow now supports distributed computation across many machines, opening up a new level of scale to the research community. In this talk, we'll go over a quick introduction to the basic TensorFlow abstractions, and will also look at some higher-level API's that offer a convenient level of abstraction for many common use cases. Folks interested in learning more are encouraged to visit tensorflow.org, and the excellent Udacity course on ML featuring TensorFlow.
Harvard Organismic and Evolutionary Biology (Chair)
The effects of population pedigrees on gene genealogies
Abstract: The models of coalescent theory for diploid organisms are wrongly based on averaging over reproductive, or family, relationships. In fact, the entire set of relationships, which may be called the population pedigree, is fixed by past events. Because of this, the standard equations of population genetics for probabilities of common ancestry are incorrect. However, the predictions of coalescent models appear surprisingly accurate for many purposes. A number of different scenarios will be investigated using simulations to illustrate the effects of pedigrees on gene genealogies both within and among loci. These scenarios include selective sweeps, the occurrence of very large families, and population subdivision with migration.
AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery
Abstract: Deep convolutional neural networks (neural nets with a constrained architecture that leverages the spatial and temporal structure of the domain they model) achieve the best predictive performance in areas such as speech and image recognition. Such neural networks autonomously discover and hierarchically compose simple local features into complex models. We demonstrate that biochemical interactions, being similarly local, are amenable to automatic discovery and modeling by similarly-constrained machine learning architectures. We describe the training of AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications, on millions of training examples derived from ChEMBL and the PDB. We visualize the automatically-derived convolutional filters and demonstrate that the system is discovering chemically sensible interactions. Finally, we demonstrate the utility of autonomously-discovered filters by outperforming previous docking approaches and achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark. In further contrast to existing DNN techniques, we show that AtomNet’s application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators.
Shantanu Singh, Anne Carpenter, Mohammad Rohban
Carpenter Lab, Broad Imaging Platform
Information in Cell Images: Targeting Diseases and Characterizing Compounds
Abstract: Our lab, the Broad’s Imaging Platform, aims to make perturbations in cell morphology as computable as other large-scale functional genomics data. We began by creating model-based segmentation algorithms to identify regions of interest in images (usually, individual cells or compartments within them) and produced software that has become the world standard for image analysis from high-throughput microscopy experiments (CellProfiler, cited in 3000+ scientific papers). We have taken on a new challenge – using cell images to identify signatures of genes and chemicals, with the ultimate goal of finding the cause and potential cures of diseases. High-throughput microscopy enables imaging several thousand cells per chemical or genetic perturbation, and identifying multiple organelles using fluorescent markers yields hundreds of image features per cell. We use this rich information to construct perturbation signatures or “profiles”. Our goals in these profiling experiments include identifying drug targets and mechanisms of action, determining the functional impact of disease-related alleles, creating performance-diverse chemical libraries, categorizing mechanisms of drug toxicity, and uncovering diagnostic markers for psychiatric disease.The technical challenges we encounter include dealing with cellular subpopulation heterogeneity, interpreting and visualizing statistical models, learning better representations of the data, and integrating imaging information with other data modalities.
DNA microscopy and the sequence-to-image inverse problem
Abstract: Technologies that jointly resolve both gene sequences and the spatial relationships of the cells that express them are playing an increasing role in deepening our understanding of tissue biology. In this talk, I will describe an experimental technique, called DNA microscopy, which encodes the physical structure and genetic composition of a biological sample directly into a library of DNA sequences. I will then discuss and demonstrate the application of N-body optimization to the inverse problem of inferring positions from real data.
Columbia University, New York Genome Center
Abstract: Molecular biology increasingly relies on large screens where enormous numbers of specimens are systematically assayed in the search for a particular, rare outcome. These screens include the systematic testing of small molecules for potential drugs and testing the association between genetic variation and a phenotype of interest. While these screens are ``hypothesis-free,'' they can be wasteful; pooling the specimens and then testing the pools is more efficient. We articulate in precise mathematical ways the type of structures useful in combinatorial pooling designs so as to eliminate waste, to provide light weight, flexible, and modular designs. We show that Reed-Solomon codes, and more generally linear codes, satisfy all of these mathematical properties. We further demonstrate the power of this technique with Reed-Solomon-based biological experiments. We provide general purpose tools for experimentalists to construct and carry out practical pooling designs with rigorous guarantees for large screens.
Stanford University and U.C. Berkeley
The Science of Information: Case Studies in DNA and RNA Assembly
Abstract: Claude Shannon invented information theory in 1948 to study the fundamental limits of communication. The theory not only establishes the baseline to judge all communication schemes but inspires the design of ones that are simultaneously information optimal and computationally efficient. In this talk, we discuss how this point of view can be applied on the problems of de novo DNA and RNA assembly from shotgun sequencing data. We establish information limits for these problems, and show how efficient assembly algorithms can be designed to attain these information limits, despite the fact that combinatorial optimization formulations of these problems are NP-hard. We discuss Shannon, a de novo RNA-seq assembly software designed based on such principles, and compare its performance against state-of-the-art assemblers on several datasets.
Bayesian structured sparsity: rethinking sparse regression
Abstract: Sparse regression has become an indispensable method for data analysis in the last 20 years. The general framework for sparse regression has a number of drawbacks that we and others address in recent methods, including robustness of model selection, issues with correlated predictors, and a test statistic that is based on the size of the effect. All of these issues arise in the context of association mapping of genetic variants to quantitative traits. This talk will discuss one approach to structured sparse regression to mitigate these problems in the context of genome-wide association mapping with quantitative traits using a Gaussian process prior to add structure to the sparsity-inducing prior across predictors. We will also describe ongoing efforts for variants on this model for different analytic purposes, including neuroscience applications, identifying driver somatic mutations in cancer, and methods for causal inference in observational data with large numbers of instruments.
Scaling and Generalizing Variational Inference
Abstract: Latent variable models have become a key tool for the modern statistician, letting us express complex assumptions about the hidden structures that underlie our data. Latent variable models have been successfully applied in numerous fields.
The central computational problem in latent variable modeling is posterior inference, the problem of approximating the conditional distribution of the latent variables given the observations. Posterior inference is central to both exploratory tasks and predictive tasks. Approximate posterior inference algorithms have revolutionized Bayesian statistics, revealing its potential as a usable and general-purpose language for data analysis.
Bayesian statistics, however, has not yet reached this potential. First, statisticians and scientists regularly encounter massive data sets, but existing approximate inference algorithms do not scale well. Second, most approximate inference algorithms are not generic; each must be adapted to the specific model at hand.
In this talk I will discuss our recent research on addressing these two limitations. I will describe stochastic variational inference, an approximate inference algorithm for handling massive data sets. I will demonstrate its application in genetics to the STRUCTURE model of Pritchard et al., 2000. Then I will discuss black box variational inference. Black box inference is a generic algorithm for approximating the posterior. We can easily apply it to many models with little model-specific derivation and few restrictions on their properties. I will demonstrate how we can use black box inference to develop new software tools for probabilistic modeling.
Scaling data analysis with Apache Spark
Abstract: [we neglected to request an abstract but are confident the speaker knows something about Spark]
University of Washington, CS, EE and Genome Sciences
Identifying molecular markers for cancer treatment from big data
Abstract: The repertoire of drugs for patients with cancer is rapidly expanding, however cancers that appear pathologically similar often respond differently to the same drug regimens. Methods to better match patients to specific drugs are in high demand. For example, patients over 65 with acute myeloid leukemia (AML), an aggressive blood cancer, have no better prognosis today than they did in 1980. For a growing number of diseases, there is a fair amount of data on molecular profiles from patients. The most important step necessary to realize the ultimate goal is to identify molecular markers in these data that predict treatment outcomes, such as response to each chemotherapy drug. However, due to the high-dimensionality (i.e., the number of variables is much greater than the number of samples) along with potential biological or experimental confounders, it is an open challenge to identify robust biomarkers that are replicated across different studies. In this talk, I will present two novel machine learning algorithms to resolve these challenges. These methods learn the low-dimensional features that are likely to represent important molecular events in the disease process in an unsupervised fashion, based on molecular profiles from multiple populations of cancer patients. These algorithms led to the identification of novel molecular markers in AML and ovarian cancer.