Broad Institute researchers generate on the order of 20 terabytes (roughly equivalent to more than 6.6 billion tweets or 3,300 high definition feature-length movies) of sequence data every day. This vast trove of information holds knowledge that could fundamentally transform our understanding of human biology, health, and disease — especially when combined with other sources of data, such as phenotypes, patient medical records, and even information from personal fitness devices.
Generating insights that will lead to breakthroughs requires that those data and the tools we build to study them be stored, curated, analyzed, updated, and shared rapidly, efficiently, openly, accurately, and broadly — all with privacy, security, and informed patient consent remaining top-of-mind.
The computer scientists, software engineers, informaticians, mathematicians, and others who make up the Broad Institute’s data science community share three core principles that form a foundation for addressing the growing computational needs of large-scale genomic and biomedical research. We believe in:
- The value of vast and diverse types of data. Biomedical research today requires platforms that allow secure but easy storage, access, analysis, and processing of sequence data, medical records, and other complementary forms of information at very large scale, while protecting patient privacy and ensuring security.
- Development of open source tools and resources. The Broad’s Data Sciences Platform has committed to making all of the software products it develops open source. (Read more in our blog post, “Open source: Foundation for the future.”)
- Widespread sharing of ideas and data within the scientific and computational community. Since before the launch of the Human Genome Project, the Broad’s research community has been committed to making data and tools available to researchers worldwide.
Members of the data sciences community are woven tightly into the fabric of the Broad. They play prominent roles in the Institute’s programs, platforms, and initiatives. A few examples:
- Cancer Genome Analysis (CGA): The CGA group in the Broad Institute’s Cancer Program develops, builds, and maintains tools for analyzing cancer genome data.
- Data Sciences Platform (DSP): The DSP is a team of software engineers, computational biologists, and other technical contributors who are developing open-source software products for the analysis of genomic and clinical data at large scale, including GATK, Picard, FireCloud, and numerous direct-to-patient portals.
- Imaging Platform: The Broad Imaging Platform develops open-source software tools for analyzing and mining imaging-based data, and helps biologists to apply them to important questions in biomedicine.
- LIMS and Analytics: The LIMS and Analytics group develops and maintains information and reporting systems that support the Broad Genomics Platform’s daily activities.
- Program in Medical and Population Genetics (MPG): Members of MPG have played key roles in developing a range of portals and computational tools, including the ExAC and gnomAD variant browsers and the Hail data analysis and exploration framework.
In addition, Broad data scientists have created two unique activities that support collaboration and provide opportunities for ongoing professional development:
- Models, Inference, and Algorithms (MIA): MIA is an initiative that supports learning and collaboration at the interface of biology with mathematics / statistics / machine learning / computer science.
- Software Engineering (SoftEng) Affinity Group: This internal group supports software engineers at Broad and their professional growth with an ongoing speaker series, career development opportunities, and occasions for community building.