This page refers to a funding mechanism that is no longer in place.

Accessing MLPCN Data

Center-Driven Project

The Broad Institute’s MLPCN Center-Driven Project (CDP) used computational analysis of profiling data to identify high-value subsets of compounds within screening collections based on biological performance diversity. Two major use cases of these profiling analyses include:

  1. Identifying new leads for MLPCN projects from novel collection of compounds synthesized at the Broad Institute.
  2. Providing a systematic and data-driven approach to creating performance-diverse screening collections in the future.

Using multiplex-cytological (MC) and gene-expression (GE) profiling assays, we assembled a reference dataset of profiles for structurally and stereochemically diverse DOS-derived compounds (18,000-20,000), chemically diverse Molecular Library Initiative (MLI) compounds with biologically diverse performance identified through analysis of PubChem (2,200-10,000), and known bioactive compounds to serve as landmarks (2,500). Careful analysis of pilot data identified how many replicates would be collected using each profiling method, determining how many compounds would be included in the reference set (~30K total for MC; ~20K total for GE). We also assembled and profiled a collection of 274 distinct compounds nominated by MLPCN Centers from projects for which the Centers would like to identify new chemical series with similar activities. Comparing the profile for a lead of interest to the reference dataset using our novel similarity-search approaches is uncovering:

  1. New compounds that share a biological profile, and thus cellular activity, with an initial lead or bioactive compound
  2. Groups of DOS compounds sharing a novel biological profile and enriched for specific structure elements (profile-based structure–activity relationships).


Access to the complete profiling dataset can be gained using the following link:

Access to the dataset used in the publication “Toward performance-diverse small-molecule libraries for cell-based phenotypic screening using multiplexed high-dimensional profiling” is available using the following link:

Profiling data are provided as both GCT and GCTX files, which differ only in encoding. GCT is a text-based format used by GenePattern. Tools to read and process HDF5-based GCTX files can be obtained via


Contact Us
If you have any questions, please email us and your request will be directed to the relevant contact.