We've got so many exciting projects going on right now, between the GATK4 alpha, expanding our scope to include copy number and structural variation, and ramping up to offer GATK-as-a-service -- we're going to need more talent! And possibly a bigger boat.
So if you or someone you know (and like) are looking to join a team working on cutting-edge analysis methods and software, real big data and a mission that matters; if you like the idea of a stimulating professional environment with competitive compensation where equal opportunity and work-life balance are not empty phrases; and if your Memorial Day plans get canceled on account of rain -- why not take that time to polish up your resume and apply for one of the jobs below?
We look forward to hearing from you!
In my last blog post, I mentioned that GATK 3.6 would support Java 8. I also mentioned that we had some evidence from our Java migration testing that GATK 3.5 (and presumably older versions as well) may produce correctness errors if run on Java 8. Since then quite a few people have expressed concern because they have been running GATK 3.5 or older versions on Java 8 already. Most wanted to know what would be the nature and amplitude of these problems, and whether they should re-run the affected data on Java 7.
We don't have definitive answers for these questions because we haven't performed end-to-end testing of GATK 3.5 on Java 8. Once we noticed that some of the automated tests were failing when we switched Java versions, we hunted down the source of the test failures (some Java list structures for which iteration order is not the same between Java versions) and fixed them.
So for us, the story stops there. But we understand that those of you who have been misguidedly running GATK on Java 8 need more information to decide what to do. I thought that perhaps sharing the relevant details from our migration test results might help, so I compiled a summary of per-tool tests that were affected, with some developer notes and values that were discussed in the issue ticket.
Since the GATK first started gaining traction in the research community ca. 2010, its development has sustained a fairly rapid pace, with a new major version (1, 2 and 3 so far) coming out about every 2 years. Each major version was a straight continuation of the same codebase, distinguished by significantly new tools and capabilities (e.g. HaplotypeCaller in 2.0, the GVCF workflow in 3.0).
This year, we're on track for a new major version, but we're breaking the mold of the classic GATK codebase. Over the past 18 months, in parallel to the ongoing 3.x development effort, we built a brand new GATK engine that is faster, more scalable and can support new types of analysis that weren't possible in the original GATK framework. Now we're hard at work porting the classic GATK tools over to the new framework, as well as developing some new ones (copy number!). The resulting toolkit will be formally released as GATK 4 later this year. If you're keen to try it out, it's already available as an alpha preview; I'll follow up on this with more details soon.
But that's not all. At the same time, we kept working on the GATK 3.x package in order to continue delivering improvements to the research community. Now we're just about ready to release version 3.6 -- nothing yuuugely different, but quite a few bug fixes and feature enhancements (especially in the GVCF workflow tools) that have been widely requested. Again, details to follow. Oh, and it supports Java 8! Which previous versions do not -- it may look like they do because they run on it without crashing, but there could be silent correctness errors. Which are the worst; I prefer a good honest in-your-face run-busting error any day.
So there you have it; version 3.6 coming out sometime next week (-ish), and GATK 4 coming out later this year, probably Fall timeframe. In between, we'll have one last 3.x release, either a patch release (3.6-x) or a proper minor release (3.7) depending on how substantial are the changes involved, to immortalize the last state of the classic GATK before it gets encased in amber.
Ever find yourself happily going about your day, using Picard and/or GATK when all of sudden a dreaded
ERROR message is output instead of your tidy SAM/BAM or VCF file? Even worse, you try to diagnose the errors using Picard's ValidateSamFile tool and the command-line output is... shall we say, "incomplete"? Well, take comfort because we have just the right medicine for you... new documentation for Picard's ValidateSamFile tool!
This document will guide even the most novice of users through the steps of diagnosing problems lurking within their SAM/BAM files. Not only that, for the first time in the history of the world, we present tables listing and explaining all of the
ERROR message outputs of this program! Woohoo!!! Now you can effectively troubleshoot your
ERROR messages by running ValidateSamFile prior to feeding your SAM/BAM file into Picard and/or GATK.
Can we all agree that this is 2016 and next-generation sequencing is really just sequencing at this point?
Seriously, I was in college when NGS was becoming a thing. In techno-geological terms, that was the Cretaceous. Yet over a decade later, this super-vague term is somehow still stuck in our collective consciousness.
I'm not the one to say what is the real next generation of sequencing, maybe Nanopore and all that exotic long-read tech. My point is that calling the current generation of sequencing technology next-gen or NGS is embarrassingly retrograde and we should all stop*.
*I'm sure we have some old articles in our docs that use the term NGS, if you point them out to me I'll fix them.
Of course, there's still Sanger sequencing and we want to be able to tell the difference -- but really, isn't it Sanger that is the oddball now, and the rest is just regular sequencing? Well, if we must -- hey look we have a technically-accurate term, it's called high-throughput sequencing. It even comes with a reasonably snappy three-letter abbreviation to slap in titles and on posters where space is at a premium: HTS (putting the hts in htsjdk).
Alright, rant over. Until next time.
The slides are available here:
The company that hosts our forums will be performing some maintenance and upgrades on the servers tomorrow, Thursday May 14th. Based on the notification they sent us (included below), we should expect the forums to be unavailable for about ~45 minutes at some point between 2pm and 8pm EST.
During that time, most of the cached GATK documentation will remain available through the Guide section of the GATK website. However, some pages may show error messages and some blog content may not be available, and of course during that time it will not be possible to ask or answer questions on the forum.
Thanks for your patience, and let's all look forward to improved reliability of our forum service.
Today, several members of our extended group are talking at the BioIT World meeting in Boston, and the Broad mothership is putting out a handful of announcements that are related to GATK. Among other communications there's a press release accompanied by a blog post on the Broad Institute blog, which unveil a landmark agreement we have reached with several major cloud vendors. I'd like to take a few minutes to discuss what is at stake, both in terms of what we're doing, and of how this will affect the wider GATK community.
These announcements all boil down to two things: we built a platform to run the Broad's GATK analysis pipelines in the cloud instead of our local cluster, and we're making that platform accessible to the wider community following a "Software as a Service" (SaaS) model.
Now, before we get any further into discussing what that entails, I want to reassure everyone that we will continue to provide the GATK software as a downloadable executable that can be used anywhere, whether locally on your laptop, on your institution's server farm or computing cluster, or on a cloud platform if you've already got that set up for yourself. The cloud-based service we're announcing is just one more option that we're making available for running GATK. And it should go without saying that we'll continue to provide the same level of support as we have in the past to everyone through the GATK forum; our commitment to that mission is absolute and unwavering.
Alright, so what's happening exactly? Read on to find out!
In my last blog post, I introduced the Cromwell+WDL pipelining solution that we developed to make it easier to write and run sophisticated analysis pipelines on cloud infrastructure. I've also mentioned in the recent past that we're building the next generation of GATK (which will be GATK 4) to run efficiently on cloud-based analysis platforms.
So in this follow-up I want to explain why we care so much about building software that runs well in the cloud, which comes down to some key benefits of "The Cloud" over more traditional computing solutions like local servers and clusters (which I'll just refer to as clusters from now on -- the distinction is not really important here).
Today I'm delighted to introduce WDL, pronounced widdle, a new workflow description language that is designed from the ground up as a human-readable and -writable way to express tasks and workflows.
As a lab-grown biologist (so to speak), I think analysis pipelines are amazing. The thought that you can write a script that describes a complex chain of processing and analytical operations, then all you need to do every time you have new data is feed it through the pipe, and out come results -- that's awesome. Back in the day, I learned to run BLAST jobs by copy-pasting gene sequences into the NCBI BLAST web browser window. When I found out you could write a script that takes a collection of sequences, submit them directly to the BLAST server, then extracts the results and does some more computation on them… mind blown. And this was in Perl, so there was pain involved, but it was worth it -- for a few weeks afterward I almost had a social life! In grad school! Then it was back to babysitting cell cultures day and night until I could find another excuse to do some "in silico" work. Where "in silico" was the hand-wavy way of saying we were doing some stuff with computers. Those were simpler times.
So it's been kind of a disappointment for the last decade or so that writing pipelines was still so flipping hard. I mean smart, robust pipelines that can understand things like parallelism, dependencies of inputs and outputs between tasks, and resume intelligently if they get interrupted. Sure, in the GATK world we have Queue, which is very useful in many respects, but it's tailored to specific computing cluster environments like LSF, and writing scala scripts is really not trivial if you're new to it. I wouldn't call Queue user-friendly by a long shot. Plus, we only really use it internally for development work -- to run our production pipelines, we've actually been using a more robust and powerful system called Zamboni. It's a real workhorse, and has gracefully handled the exponential growth of sequencer output up to this point, but it's more complicated to use -- I have to admit I haven't managed to wrap my head around how it works.
Fortunately I don't have to try anymore: our engineers have developed a new pipelining solution that involves WDL, the Workflow Description Language I mentioned earlier, and an execution engine called Cromwell that can run WDL scripts anywhere, whether locally or on the cloud.
It's portable, super user-friendly by design, and open-source (under BSD) -- we’re eager to share it with the world!