You are here

2020 Workshops

Scalable Genomic Analysis using Hail

Hail is an open-source library that provides accessible interfaces for exploring genomic data, with a backend that automatically scales to take advantage of large compute clusters. Hail enables those without expertise in parallel computing to flexibly, efficiently, and interactively analyze large sequencing datasets. Hail is the analytical engine behind projects such as the Genome Aggregation Database, the UK Biobank mega-GWAS, eQTLs in GTEx, TOPMed, the Psychiatric Genomics Consortium, and the Centers for Mendelian Genomics. This workshop provides an introduction to Hail through hands-on exploration and analysis of public 1000 Genomes data. Following a brief conceptual overview, participants will be guided through a hands-on tutorial with interactive exercises. The workshop covers some the most common use cases: general-purpose data exploration functionality; variant and sample quality control; common variant association; and rare variant burden tests. By the end, participants will be ready to begin using Hail to answer their own scientific questions.

 
March 5

Introduction to Machine Learning on Biomedical Data

The course will begin with a very brief overview of the mathematical foundations of ML, specifically linear regression, logistic regression, and multilayer perceptrons. Applying these models to the well-studied dataset of hand written digits, MNIST, we’ll gain first-hand experience with model selection, training and validation. We will then introduce several abstractions from the ML4CVD codebase (TensorMaps, Model Factories, and Tensorization) which accelerate and simplify the process of preparing data for ML, building models, and evaluating them. Then we will examine several real world biomedical datasets of various sizes, structures and quality. Specifically, from the Allen Brain Atlas we will load high-dimensional brain MRI data linked with gene expression microarrays, from Qure.ai we will use the CQ500 dataset of 500 CT scans containing 193K slices linked with medical reports from 3 senior radiologists, and lastly from the Erowid website raw natural language text testimonials of drug experiences linked with basic demographics. These data will present many new challenges, such as noisy labels, small sample sizes, confounding by indication, and batch effects. These issues are less prevalent in typical ML data, like MNIST or imagenet, but very common with biomedical data. After visually and statistically exploring the datasets, we will setup several ML problems using them as raw data. Lastly, we will explore model interpretation by visualizing saliency maps, class activation maps, and training adversarial examples. The course will conclude with a general discussion on framing biological questions as machine learning problems and the opportunity for participants to brainstorm how ML might apply to their own datasets.

 
February 28

Introduction to PyMOL

PyMOL is a molecular visualization software used to visualize protein, DNA, and RNA in 3D. This software allows the user to view and analyze their system of interest and create high quality images and movies of their work.

This half-day workshop, designed for new and existing users, will be run by a team from Schrödinger, designers of the PyMOL software.

Who should attend
Computational biologists, software engineers, chemists and other Broad Institute employees and affiliates, interested in enhancing their ability to effectively use the PyMOL software. Space is limited to 30 attendees.

How to get PyMOL at the Broad?
It is recommended that you bring your laptop with PyMOL preinstalled and a three button mouse to this workshop. The Broad has a site license for the commercial version of PyMOL for Windows and Mac, to install it please bring your laptop to BITStop at 75A-6001. Alternatively, Mac users can install the PyMOL software themselves using the Self Service Tools by following the directions in this link.

 
February 11
 

From Missense Variants to Protein Sequence and Structure
With high-throughput next-generation sequencing platforms, it is now feasible to analyze the exome and genome of thousands, and soon millions, of people. Such growth of data comes with the daunting challenge of the segregation and interpretation of neutral and disease-associated amino acid altering missense variation, particularly for clinical variants observed in patients. Thus far, both functional and clinical interpretation of variation in disease-associated genes lags far behind data generation. In this workshop, we will describe how to use two Broad-hosted online resources, PER (Pathogenic variants Enriched Regions) and MISCAST (Missense variant to Protein Structure Analysis Suite), to analyze the impact of missense mutations on protein sequence and structure, and therefore, interpret the variant. MISCAST is a platform with which a user can explore missense variants (single amino acid substitutions) mapped on linear protein sequence and 3-dimensional protein structure. In addition, MISCAST provides amino-acid-based annotations of proteins' structural, physicochemical, and functional features, combined and aggregated from multiple open source databases. On the one hand, the user can explore the impact of amino acid substitution in the context of perturbation of protein features and functions. On the other hand, through the PER browser, the user can investigate a missense mutation position in the context of essential regions of protein sequences that are constrained from variation from the general population and enriched with pathogenic variants observed in patients. The course will be divided into three parts: 1) a short lecture on the problem and background, 2) how to use MISCAST and PER browser, and 3) hands-on tasks using the online tools.

 
January 24