Can we all agree that this is 2016 and next-generation sequencing is really just sequencing at this point?
Seriously, I was in college when NGS was becoming a thing. In techno-geological terms, that was the Cretaceous. Yet over a decade later, this super-vague term is somehow still stuck in our collective consciousness.
I'm not the one to say what is the real next generation of sequencing, maybe Nanopore and all that exotic long-read tech. My point is that calling the current generation of sequencing technology next-gen or NGS is embarrassingly retrograde and we should all stop*.
*I'm sure we have some old articles in our docs that use the term NGS, if you point them out to me I'll fix them.
Of course, there's still Sanger sequencing and we want to be able to tell the difference -- but really, isn't it Sanger that is the oddball now, and the rest is just regular sequencing? Well, if we must -- hey look we have a technically-accurate term, it's called high-throughput sequencing. It even comes with a reasonably snappy three-letter abbreviation to slap in titles and on posters where space is at a premium: HTS (putting the hts in htsjdk).
Alright, rant over. Until next time.
The slides are available here:
The company that hosts our forums will be performing some maintenance and upgrades on the servers tomorrow, Thursday May 14th. Based on the notification they sent us (included below), we should expect the forums to be unavailable for about ~45 minutes at some point between 2pm and 8pm EST.
During that time, most of the cached GATK documentation will remain available through the Guide section of the GATK website. However, some pages may show error messages and some blog content may not be available, and of course during that time it will not be possible to ask or answer questions on the forum.
Thanks for your patience, and let's all look forward to improved reliability of our forum service.
Today, several members of our extended group are talking at the BioIT World meeting in Boston, and the Broad mothership is putting out a handful of announcements that are related to GATK. Among other communications there's a press release accompanied by a blog post on the Broad Institute blog, which unveil a landmark agreement we have reached with several major cloud vendors. I'd like to take a few minutes to discuss what is at stake, both in terms of what we're doing, and of how this will affect the wider GATK community.
These announcements all boil down to two things: we built a platform to run the Broad's GATK analysis pipelines in the cloud instead of our local cluster, and we're making that platform accessible to the wider community following a "Software as a Service" (SaaS) model.
Now, before we get any further into discussing what that entails, I want to reassure everyone that we will continue to provide the GATK software as a downloadable executable that can be used anywhere, whether locally on your laptop, on your institution's server farm or computing cluster, or on a cloud platform if you've already got that set up for yourself. The cloud-based service we're announcing is just one more option that we're making available for running GATK. And it should go without saying that we'll continue to provide the same level of support as we have in the past to everyone through the GATK forum; our commitment to that mission is absolute and unwavering.
Alright, so what's happening exactly? Read on to find out!
In my last blog post, I introduced the Cromwell+WDL pipelining solution that we developed to make it easier to write and run sophisticated analysis pipelines on cloud infrastructure. I've also mentioned in the recent past that we're building the next generation of GATK (which will be GATK 4) to run efficiently on cloud-based analysis platforms.
So in this follow-up I want to explain why we care so much about building software that runs well in the cloud, which comes down to some key benefits of "The Cloud" over more traditional computing solutions like local servers and clusters (which I'll just refer to as clusters from now on -- the distinction is not really important here).
Today I'm delighted to introduce WDL, pronounced widdle, a new workflow description language that is designed from the ground up as a human-readable and -writable way to express tasks and workflows.
As a lab-grown biologist (so to speak), I think analysis pipelines are amazing. The thought that you can write a script that describes a complex chain of processing and analytical operations, then all you need to do every time you have new data is feed it through the pipe, and out come results -- that's awesome. Back in the day, I learned to run BLAST jobs by copy-pasting gene sequences into the NCBI BLAST web browser window. When I found out you could write a script that takes a collection of sequences, submit them directly to the BLAST server, then extracts the results and does some more computation on them… mind blown. And this was in Perl, so there was pain involved, but it was worth it -- for a few weeks afterward I almost had a social life! In grad school! Then it was back to babysitting cell cultures day and night until I could find another excuse to do some "in silico" work. Where "in silico" was the hand-wavy way of saying we were doing some stuff with computers. Those were simpler times.
So it's been kind of a disappointment for the last decade or so that writing pipelines was still so flipping hard. I mean smart, robust pipelines that can understand things like parallelism, dependencies of inputs and outputs between tasks, and resume intelligently if they get interrupted. Sure, in the GATK world we have Queue, which is very useful in many respects, but it's tailored to specific computing cluster environments like LSF, and writing scala scripts is really not trivial if you're new to it. I wouldn't call Queue user-friendly by a long shot. Plus, we only really use it internally for development work -- to run our production pipelines, we've actually been using a more robust and powerful system called Zamboni. It's a real workhorse, and has gracefully handled the exponential growth of sequencer output up to this point, but it's more complicated to use -- I have to admit I haven't managed to wrap my head around how it works.
Fortunately I don't have to try anymore: our engineers have developed a new pipelining solution that involves WDL, the Workflow Description Language I mentioned earlier, and an execution engine called Cromwell that can run WDL scripts anywhere, whether locally or on the cloud.
It's portable, super user-friendly by design, and open-source (under BSD) -- we’re eager to share it with the world!
Picard metrics that say 'percent' actually mean 'fraction'. Let's take metrics from MarkDuplicates as an example. Under PERCENT_DUPLICATION we see 0.134008. If we divide READ_PAIR_DUPLICATES by READ_PAIRS_EXAMINED we get ~1/7 or 14%. Our sanity check makes clear the PERCENT_DUPLICATION metric is a fraction that translates to 13.4%.
When you're setting up a variant discovery pipeline, you face two problems: deciding what tools to run (with what options), and how to run them efficiently so that it doesn't take forever. Between our documentation and our support forum, we can get you most if not all the way to solving the first problem, unless you're working with something really unusual.
However, the second problem is not something we've been able to help much with. We only benchmark computational requirements/performance for the purposes of our in-house pipelines, which are very specific to our particular infrastructure, and we don't have the resources to test different configurations. As a result it's been hard for us to give satisfying answers to questions like "How long should this take?" or "How much RAM do I need?" -- and we're aware this is a big point of pain.
So I'm really pleased to announce that a team of engineers at Intel have been developing a system to profile pipelines that implement our Best Practices workflows on a range of hardware configurations. This is a project we've been supporting by providing test data and implementation advice, and it's really gratifying to see it bear fruit: the team recently published their first round of profiling, done on the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF) on a trio of whole genomes.
The white paper is available from the GATK-specific page of Intel's Health-IT initiative website and contains some very useful insights regarding key bottlenecks in the pipeline. It also details the applicability of parallelizing options for each tool, as well as the effect of using different numbers of threads on performance, when run on a single 36-core machine. Spoiler alert: more isn't always better!
Read on for a couple of highlights of what I thought was especially interesting in the Intel team's white paper.
The presentation slide decks and hands-on tutorial materials can be downloaded at this Google Drive link.
Greetings fellow citizens of the Galactic Commonwealth. We the GATK Support Team welcome you once again to our understated, yet exceptionally informative blog.
As you may or may not know, the GATK and Picard toolkits are on converging arcs, and will eventually be merged into the unified super-beefed up toolkit that will be GATK 4. In preparation for that bright future, we have been fielding Picard usage-related questions in the GATK forum, and we have now officially taken over stewardship of the Picard documentation.
As it is so often said, with great power comes great responsibility. Our mission is to help make the user experience, as painless as humanly possible. “How does the support team carry out this challenging yet critically important task”, you ask?
Well, we've just completed our first big push on two fronts. In terms of content, we added a lot of new information to the Picard tool documentation, to provide more detail on what the tools do and how to use them optimally. Most recently, we've also been making some substantial changes to the organization of the content, as well as tweaking the general look and feel of the website.