Identification of small molecules targeting specific biological processes is an important problem in developing new probes or drugs. In the optimization phase, this process entails chemically modifying initial ‘hit’ or ‘lead’ molecules to increase performance in terms of potency, selectivity, or solubility. Researchers often select which portions of a compound structure to modify by intuition and trial and error due to the complexities of compound structure-activity relationships (SARs). Given rich biological information from small-molecule profiling data, however, we can use information from multiple measurements of many compounds to develop predictive computational models relating performance to small-molecule structure. We develop a general, automated method of determining biologically relevant features of compounds based on data instead of intuition. In earlier work, we prototyped a method for detecting biologically relevant features for compounds in a specific library with limited structural diversity (Tanikawa et al., 2009). This method was not amenable to prediction because the chemical features considered were not generalizable to any compound. Furthermore, this method did not allow us to reason about the relative importance of substructural compound features. We now develop a method for predicting small-molecule performance compatible with a more diverse library of compounds using tools that automatically generate substructural features from a library of compounds. We design a technique based on regression trees (Breiman, 1984) and elastic net (Zou & Hastie, 2005) statistical methods to acquire a list of importance-weighted biologically relevant compound substructures in diverse libraries. We validate the method on a small-molecule microarray (SMM) assay (Duffner, Clemons, & Koehler, 2007) measuring the binding of ~6,600 compounds to 100 transcription factors. This method has the potential to improve our quantitative understanding of compound structure biological implications and guide future compound library syntheses.
PROJECT: Predictive modeling of small-molecule protein binding and binding promiscuity: Case study using 100 transcription factors studied by small-molecule microarrays
Mentor: Paul Clemons, Chemical Biology Program
"The Broad is a fantastic research institution involving researchers from many backgrounds with shared passion for science and knowledge. Being able to return to such an environment has been an incredibly inspiring and empowering experience and invaluable for my scientific career."