Mauricio Carneiro presented this slide deck at the workshop organized by Mnt Sinai School of Medicine on December 10, 2013. The other presentations made at the workshop were posted here.
Please note that we cannot guarantee content hosted on other websites; if outgoing links becomes outdated please let us know.
Here's my abstract for the upcoming Genome Science UK meeting in Oxford, where I'll be talking about our hot new workflow for variant discovery. The slide deck will be posted in the Presentations section as usual after the conference.
Variant discovery is greatly empowered by the ability to analyse large cohorts of samples rather than single samples taken in isolation, but doing so presents considerable challenges. Variant callers that operate per-locus (such as Samtools and GATK’s UnifiedGenotyper) can handle fairly large cohorts (thousands of samples) and produce good results for SNPs, but they perform poorly on indels. More recently developed callers that operate using assembly graphs (such as Platypus and GATK’s HaplotypeCaller) perform much better on indels, but their runtime and computational requirements tend to increase exponentially with cohort size, limiting their application to cohorts of hundreds at most. In addition, traditional multisample calling workflows suffer from the so-called “N+1 problem”, where full cohort analysis must be repeated each time new samples are added.
To overcome these challenges, we developed an innovative workflow that decouples the two steps in the multisample variant discovery process: identifying evidence of variation in each sample, and interpreting that evidence in light of the evidence gathered for the entire cohort. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and much faster) on one sample at a time. This decoupling hinges on the use of a novel method for reference confidence estimation that produces a genomic VCF (gVCF) intermediate for each sample.
The new workflow enables fast, highly accurate and computationally cheap variant discovery in cohort sizes that were previously intractable: it has already been applied successful to a cohort of nearly one hundred thousand samples. This replaces previous brute-force approaches and lowers the threshold of accessibility of sophisticated cohort analysis methods for all, including researchers who do not have access to large amounts of computing power.
Ladies, gentlemen and everyone else (this is a judgment-free zone), it's officially summertime in the norther hemisphere. Depending on who and where you are, this can mean no more classes, no more exams, and more quality time with your loved ones -- or extra expense getting someone to keep your offspring out of your way (hello summer camp). It is that hallowed time of year when academics put down the burdens of teaching and administrative duties and can finally get some science done. For many, it also means conference season, e.g. meeting up in Spain with a bunch of colleagues to argue about obscure methodological details over many a glass of tinto de verano. It's a hard, hard life.
A group of us just got back from sunny Belgium* where we held a GATK workshop at the invitation of the Royal Institute for Natural Sciences in Brussels. Now we're looking ahead to the next big dates on the horizon, and I thought I'd share them here in case some of you can join us. Or in case you would like to invite us over to give talks or workshops... (seriously, private-message me if you're interested in hosting a GATK workshop).
* This is not irony, it was really beautiful the whole time. Until the two non-Belgians left, and then boom! Downpour for three days. Typical.
As you can see below, our dance card is all clear for the summer itself but starting September it gets pretty busy.
I will be attending the Genome Science meeting in Oxford, UK, and giving a talk in the Bioinformatics Infrastructure session.
Mauricio Carneiro and David Roazen will be attending this C++ development conference, and Mauricio will be presenting a talk titled "Gamgee: A C++14 library for genomics data processing and analysis". What this means for GATK-based development, well... happy speculating :)
The Center for Genetics and Complex Traits (CGACT) and the Institute for Biomedical Informatics (IBI) of the University of Pennsylvania Perelman School of Medicine are hosting our workshop. The workshop crew will consist of Eric Banks, Sheila Chandran and myself. See this announcement for more details.
Ami Levy-Moonshine will be attending the ASHG meeting in San Diego, California, and giving a compressed version of our Best Practices workshop on Tuesday 10/21 (separate registration required). Ami will also represent us in the iSeqTools workshop on cloud-based analysis on Monday10/20.
Details to be announced.
This list will be updated with any new events up to December 2014.
Alright, the next release is going to be version 3.0. So what's in it??
We'll have a full overview ready for you in the next few days.
In the meantime, if you're at AGBT-2014 working on your tan (lucky devil), one way to find out is to go see Mauricio Carneiro's poster during the Thursday afternoon poster session, and ask him all about it (you're welcome, MC -- we know you like the attention).
If you're not, here's a copy of Mauricio's poster, which features three of the top features in GATK 3.0. Because why should you miss out, on top of having to shovel snow all over again tomorrow (or whatever the applicable chore is in your neck of the woods) instead of drinking margaritas by the pool?
In case you're wondering, the owl is the mascot of the nightly builds. He/she (we're not sure; we respect its privacy) builds a fresh copy of the GATK every night with the day's new developments that made it into the master branch.