You are here

Data Sciences Platform

The Broad Data Sciences Platform (DSP) is a methods development and software engineering group dedicated to maximizing the impact of the data sciences on the life sciences. DSP engineers, analysts, and designers build applications and capabilities to serve the Broad and beyond.

The DSP is organized around four principal components:

  • Workbench: A suite of web services that provide foundational computational capabilities to support genomic and biomedical science, providing infrastructure for storing, sharing, and analyzing genomic and clinical data at unlimited scale. Workbench supports FireCloud (cloud-based cancer genomics analysis platform), as well as the All of Us Research Program.

  • Analytical tools: Applications and approaches such as GATK that incorporate best practice pipelines for extracting all available information from read-level data, available via portals and other software-as-a-service mechanisms.

  • User interfaces: Web-based portals and other ways to data and analytical methods available that engage researchers, clinicians, and patients. In particular, we develop software to support a number of direct-to-patient studies.

  • Production data processing: Tools and applications designed and scaled to process massive volumes of raw genome sequence data into forms scientists can use to create new knowledge. As part of this effort, we partner with the Broad Genomics Platform to process all data that they produce and reduce it to a form that is usable by researchers.

Flagship DSP software products

The DSP develops software products that are widely used across the biomedical ecosystem, such as:

  • GATK: the leading variant discovery package for analysis of high-throughput sequencing data.

  • Picard: a popular set of command line tools for processing high-throughput sequencing data

  • WDL (pronounced widdle): A user-friendly workflow description language designed from the ground up as a human-readable and -writable way of expressing tasks and workflows.

  • Cromwell: An execution engine that allows users to run WDL or common workflow language (CWL) scripts on local machines, computer clusters, and cloud platforms (e.g., AWS, Google Cloud Platform)

  • The Data Donation Platform (DDP): A software stack that enables direct participant engagement, including consent and recontact, via intuitive web and mobile interfaces. DDP provides the underlying infrastructure for disease-specific registries such as the Angiosarcoma Project, the Rare Genomes Project, and the Global A-T Family Data Platform.

  • The Data Use Oversight System (DUOS): A suite of interfaces for managing interactions between data access committees and researchers seeking to access sensitive genomic datasets.

Flagship scientific projects and portals

The DSP plays pivotal roles in several national and international scientific initiatives, including:

  • All of Us Research Program: A National Institutes of Health (NIH)-funded initiative that will recruit 1 million or more U.S. citizens and collect their genomic and clinical data. Broad, in collaboration with Vanderbilt and Verily, is building a Workbench-based platform to store, share, and analyze all data generated as part of the program.

  • The Exome Aggregation Consortium (ExAC) and Genome Aggregation Database (gnomAD) browsers: The DSP supports members of the Broad Program in Medical and Population Genetics to process and analyze these large collections of genome and exome data generated by collaborators around the world.

  • The Genotype-Tissue Expression (GTEx) Portal: A resource for genotype and tissue-specific gene expression correlation data.

  • The Human Cell Atlas (HCA): An international effort to comprehensively characterize cell types and cell states in health and disease. Broad, in collaboration with the European Bioinformatics Institute, the University of California at Santa Cruz (UCSC), and the Chan-Zuckerberg Initiative, is building the HCA Data Coordination Platform, which will serve as the effort’s central collection, quality control, data processing, and data sharing point.

  • The Single Cell Sequencing Portal: A visualization and data exploration portal for single cell RNA sequencing data.

  • Cancer Genome Commons: A National Cancer Institute-funded initiative to provide a cloud-based ecosystem for storing, sharing, and analyzing key cancer datasets, including those of The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research To Generate Effective Treatments (TARGET) initiatives.

  • NIH Data Commons: An NIH-funded initiative to create a data commons for hosting key datasets, including Trans-Omics for Precision Medicine (TOPMed), GTEx, and model organism datasets. Broad, UCSC, and the University of Chicago are collaborating to create a software platform for storing, sharing, and analyzing data deposited in the commons.

Flagship DSP partnerships

To bring the tools of machine learning and cloud computing to bear on problems of fundamental importance to biomedicine, the DSP collaborates with world leading technology corporations, philanthropic organizations, and pharmaceutical companies such as: