This talk was presented by Geraldine Van der Auwera at Genome Science UK at Oxford on September 2, 2014. Get the slide deck here; abstract below.
Variant discovery is greatly empowered by the ability to analyse large cohorts of samples rather than single samples taken in isolation, but doing so presents considerable challenges. Variant callers that operate per-locus (such as Samtools and GATK’s UnifiedGenotyper) can handle fairly large cohorts (thousands of samples) and produce good results for SNPs, but they perform poorly on indels. More recently developed callers that operate using assembly graphs (such as Platypus and GATK’s HaplotypeCaller) perform much better on indels, but their runtime and computational requirements tend to increase exponentially with cohort size, limiting their application to cohorts of hundreds at most. In addition, traditional multisample calling workflows suffer from the so-called “N+1 problem”, where full cohort analysis must be repeated each time new samples are added.
To overcome these challenges, we developed an innovative workflow that decouples the two steps in the multisample variant discovery process: identifying evidence of variation in each sample, and interpreting that evidence in light of the evidence gathered for the entire cohort. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and much faster) on one sample at a time. This decoupling hinges on the use of a novel method for reference confidence estimation that produces a genomic VCF (gVCF) intermediate for each sample.
The new workflow enables fast, highly accurate and computationally cheap variant discovery in cohort sizes that were previously intractable: it has already been applied successfully to a cohort of nearly one hundred thousand samples. This replaces previous brute-force approaches and lowers the threshold of accessibility of sophisticated cohort analysis methods for all, including researchers who do not have access to large amounts of computing power.
These are the materials that were presented at the November 2015 GATK workshop at the Broad Institute in Cambridge, MA.
|Slide decks presented on Day 1||Google Drive Folder|
|Workshop handout document (agenda and talk abstracts)||PDF on Google Drive|
|Variant Discovery Tutorial (Day 2 AM) (=ASHG15 Tutorial)||Google Drive Folder|
|Callset Filtering and Evaluation Tutorial (Day 2 PM)||Google Drive Folder|
The slides from the talk I gave at the Genome Science UK conference (Oxford, Sept 2) are now available here.
A few impressions below the fold.
As always it was fun meeting plenty of GATK users and other researchers in general, and very exciting to get to spread the word about our new workflow and the HaplotypeCaller's capabilities. Special thanks to Mick Watson from Edinburgh Genomics for inviting me, and his team for making me feel super welcome.
I really enjoyed getting to see quite a few microbial genomics talks, since I am originally a microbiologist by training. Too bad I had to miss the plant/forestry B session, as I'm very curious about the crazy-ploidy aspects of plant genomics, but that's the curse of parallel sessions I suppose. Very interesting conference overall for sure, and a nice group size -- lively but not overwhelming (I dislike mega-cons like ASHG). Though maybe next time Nick Loman should get the main lecture theater for his MinION talk instead of the little basement room -- and someone should make sure the wifi network can handle dozens of data-crazed scientists trying to download his MinION datasets at the same time.
A final note on the live-tweeting, i.e. people in the audience tweeting snippets during the talk. I was aware of this as a trend, and in fact I've followed tweet-streams of other people's talks, but had never experienced it myself as a speaker. It's a bit nerve-inducing but very interesting as an insight into what (at least some of) the audience reacted most strongly to and took away from the talk. It also stimulated some interesting follow-up exchanges, so I'll tentatively classify it as a Good Thing for now.
Ladies, gentlemen and everyone else (this is a judgment-free zone), it's officially summertime in the norther hemisphere. Depending on who and where you are, this can mean no more classes, no more exams, and more quality time with your loved ones -- or extra expense getting someone to keep your offspring out of your way (hello summer camp). It is that hallowed time of year when academics put down the burdens of teaching and administrative duties and can finally get some science done. For many, it also means conference season, e.g. meeting up in Spain with a bunch of colleagues to argue about obscure methodological details over many a glass of tinto de verano. It's a hard, hard life.
A group of us just got back from sunny Belgium* where we held a GATK workshop at the invitation of the Royal Institute for Natural Sciences in Brussels. Now we're looking ahead to the next big dates on the horizon, and I thought I'd share them here in case some of you can join us. Or in case you would like to invite us over to give talks or workshops... (seriously, private-message me if you're interested in hosting a GATK workshop).
* This is not irony, it was really beautiful the whole time. Until the two non-Belgians left, and then boom! Downpour for three days. Typical.
As you can see below, our dance card is all clear for the summer itself but starting September it gets pretty busy.
I will be attending the Genome Science meeting in Oxford, UK, and giving a talk in the Bioinformatics Infrastructure session.
Mauricio Carneiro and David Roazen will be attending this C++ development conference, and Mauricio will be presenting a talk titled "Gamgee: A C++14 library for genomics data processing and analysis". What this means for GATK-based development, well... happy speculating :)
The Center for Genetics and Complex Traits (CGACT) and the Institute for Biomedical Informatics (IBI) of the University of Pennsylvania Perelman School of Medicine are hosting our workshop. The workshop crew will consist of Eric Banks, Sheila Chandran and myself. See this announcement for more details.
Ami Levy-Moonshine will be attending the ASHG meeting in San Diego, California, and giving a compressed version of our Best Practices workshop on Tuesday 10/21 (separate registration required). Ami will also represent us in the iSeqTools workshop on cloud-based analysis on Monday10/20.
Due to scheduling constraints, we were unable to make this workshop happen, but we have tentative dates for March 2015 (3/19-3/20).
This list will be updated with any new events up to December 2014.