Data Sciences Platform

The life sciences are in the midst of a data revolution. Cheap and accurate genome sequencing is a reality, advanced imaging is routine, and clinical data is increasingly stored in electronic formats. These innovations — and the massive data sets they produce — have brought us to the threshold of a new era in medicine, one where the data sciences hold the potential to propel our understanding and treatment of human disease.

The Broad Data Sciences Platform (DSP) is a methods development and software engineering group dedicated to maximizing the impact of the data sciences on the life sciences. DSP engineers, analysts, and designers build applications and capabilities to serve the Broad and beyond.

The DSP is organized around four principal components:

  • Workbench: A suite of web services that provide foundational computational capabilities to support genomic and biomedical science, providing infrastructure for storing, sharing, and analyzing genomic and clinical data at unlimited scale. Workbench supports Terra, an open cloud-based platform for accessing data, performing analyses, and collaborating securely in the cloud, which powers projects like FireCloud, AnVIL, and the Researcher Workbench used by the All of Us Research Program.

  • Analytical tools: Open-source applications and approaches such as GATK that provide best practice pipelines for extracting all available information from read-level data, available for download as well as via portals and other software-as-a-service mechanisms.

  • User interfaces: Web-based portals and other ways to data and analytical methods available that engage researchers, clinicians, and patients. In particular, we develop software to support a number of direct-to-patient studies.

  • Production data processing: Tools and applications designed and scaled to process massive volumes of raw genome sequence data into forms scientists can use to create new knowledge. As part of this effort, we partner with the Broad Genomics Platform to process all data that they produce and reduce it to a form that is usable by researchers.

Flagship DSP software products and services

The DSP develops software products and operates services that are widely used across the biomedical ecosystem, such as:

  • Terra: an open cloud-based platform for accessing data, performing analyses and collaborating securely in the cloud, developed in collaboration with Microsoft and Verily Life Sciences.

  • GATK: the leading open-source variant discovery package for analysis of high-throughput sequencing data.

  • Picard: a popular set of open-source command line tools for processing high-throughput sequencing data

  • Cromwell: An execution engine that allows users to run reproducible workflows written in either the Workflow Description Language (WDL, pronounced widdle) or the Common Workflow Language (CWL), portable across local machines, computer clusters, and cloud platforms (e.g., AWS, Microsoft Azure, Google Cloud Platform)

  • The Data Donation Platform (DDP): A software stack that enables direct participant engagement, including consent and recontact, via intuitive web and mobile interfaces. DDP provides the underlying infrastructure for disease-specific registries such as the Angiosarcoma Project, the Rare Genomes Project, and the Global A-T Family Data Platform.

  • The Data Use Oversight System (DUOS): A suite of interfaces for managing interactions between data access committees and researchers seeking to access sensitive genomic datasets.

Flagship scientific projects and portals

The DSP plays pivotal roles in several national and international scientific initiatives, including:

  • All of Us Research Program: A National Institutes of Health (NIH)-funded initiative that will recruit 1 million or more U.S. citizens and collect their genomic and clinical data. Broad, in collaboration with Vanderbilt and Verily, is building a Workbench-based platform to store, share, and analyze all data generated as part of the program.

  • The Exome Aggregation Consortium (ExAC) and Genome Aggregation Database (gnomAD) browsers: The DSP supports members of the Broad Program in Medical and Population Genetics to process and analyze these large collections of genome and exome data generated by collaborators around the world.

  • The Genotype-Tissue Expression (GTEx) Portal: A resource for genotype and tissue-specific gene expression correlation data.

  • The Human Cell Atlas (HCA): An international effort to comprehensively characterize cell types and cell states in health and disease. Broad, in collaboration with the European Bioinformatics Institute, the University of California at Santa Cruz (UCSC), and the Chan-Zuckerberg Initiative, is building the HCA Data Coordination Platform, which will serve as the effort’s central collection, quality control, data processing, and data sharing point.

  • The Single Cell Portal: A visualization and data exploration portal for single cell RNA sequencing data.

  • Cancer Genome Commons: A National Cancer Institute-funded initiative to provide a cloud-based ecosystem for storing, sharing, and analyzing key cancer datasets, including those of The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research To Generate Effective Treatments (TARGET) initiatives.

  • NIH Data Commons: An NIH-funded initiative to create a data commons for hosting key datasets, including Trans-Omics for Precision Medicine (TOPMed), GTEx, and model organism datasets. Broad, UCSC, and the University of Chicago are collaborating to create a software platform for storing, sharing, and analyzing data deposited in the commons.

Flagship DSP partnerships

To bring the tools of machine learning and cloud computing to bear on problems of fundamental importance to biomedicine, the DSP collaborates with world leading technology corporations, philanthropic organizations, and pharmaceutical companies such as: