Data Sciences

Broad Institute researchers generate on the order of 20 terabytes (roughly equivalent to more than 6.6 billion tweets or 3,300 high definition feature-length movies) of sequence data every day. This vast trove of information holds knowledge that could fundamentally transform our understanding of human biology, health, and disease — especially when combined with other sources of data, such as phenotypes, patient medical records, and even information from personal fitness devices.
Generating insights that will lead to breakthroughs requires that those data and the tools we build to study them be stored, curated, analyzed, updated, and shared rapidly, efficiently, openly, accurately, and broadly — all with privacy, security, and informed patient consent remaining top-of-mind.
The computer scientists, software engineers, informaticians, mathematicians, and others who make up the Broad Institute’s data science community share three core principles that form a foundation for addressing the growing computational needs of large-scale genomic and biomedical research. We believe in:
- The value of vast and diverse types of data. Biomedical research today requires platforms that allow secure but easy storage, access, analysis, and processing of sequence data, medical records, and other complementary forms of information at very large scale, while protecting patient privacy and ensuring security.
- Development of open source tools and resources. The Broad’s Data Sciences Platform has committed to making all of the software products it develops open source. (Learn more in our blog post, “Open source: Foundation for the future”, and our explainer, "Creating tools to generate data insights.")
- Widespread sharing of ideas and data within the scientific and computational community. Since before the launch of the Human Genome Project, the Broad’s research community has been committed to making data and tools available to researchers worldwide.
Members of the data sciences community are woven tightly into the fabric of the Broad. They play prominent roles in the Institute’s programs, platforms, and initiatives. A few examples:
- Cancer Program: The Broad Cancer Program’s many data scientists form the backbone of several large teams, including the Cancer Data Science, Cancer Genome Computational Analysis, and Connectivity Map groups. These teams develop, build, and maintain a variety of tools and resources for analyzing a wide variety of high-throughput screening results and cancer genome data, such as the Cancer Dependency Map portal and Drug Repurposing Hub. Many of these tools are available on the Broad's Data, Software and Tools page.
- Data Sciences Platform (DSP): The DSP is a team of software engineers, computational biologists, and other technical contributors who are developing open-source software products for the analysis of genomic and clinical data at large scale, including Terra, GATK, Picard, FireCloud, WDL, and numerous direct-to-patient portals.
- Epigenomics Program: The Broad Epigenomics Program includes robust computational and software engineering efforts responsible for developing tools and generating data for understanding how the genome is regulated.
- Imaging Platform: The Broad Imaging Platform develops open-source software tools such as CellProfiler and CellProfiler Analyst for analyzing and mining image-based data, and helps biologists to apply them to important questions in biomedicine.
- LIMS and Analytics: The LIMS and Analytics group develops and maintains information and reporting systems that support the Broad Genomics Platform’s daily activities.
- Program in Medical and Population Genetics (MPG): Members of MPG have played key roles in developing a range of portals and computational tools, including the Genome Aggregation Database (gnomAD) variant browser and the Hail variant analysis and exploration framework.
In addition, Broad data scientists have created two unique activities that support collaboration and provide opportunities for ongoing professional development:
- Models, Inference, and Algorithms (MIA): MIA is an initiative that supports learning and collaboration at the interface of biology with mathematics / statistics / machine learning / computer science.
- Software Engineering (SoftEng) Affinity Group: This internal group supports software engineers at Broad and their professional growth with an ongoing speaker series, career development opportunities, and occasions for community building.
