## Job openings for a Data Scientist and an Associate Software Engineer

### Posted by Geraldine_VdAuwera on 29 May 2016 (0)

We've got so many exciting projects going on right now, between the GATK4 alpha, expanding our scope to include copy number and structural variation, and ramping up to offer GATK-as-a-service -- we're going to need more talent! And possibly a bigger boat.

So if you or someone you know (and like) are looking to join a team working on cutting-edge analysis methods and software, real big data and a mission that matters; if you like the idea of a stimulating professional environment with competitive compensation where equal opportunity and work-life balance are not empty phrases; and if your Memorial Day plans get canceled on account of rain -- why not take that time to polish up your resume and apply for one of the jobs below?

We look forward to hearing from you!

## Notes on differences in GATK 3.5 test results between Java 7 to 8

### Posted by Geraldine_VdAuwera on 20 May 2016 (0)

In my last blog post, I mentioned that GATK 3.6 would support Java 8. I also mentioned that we had some evidence from our Java migration testing that GATK 3.5 (and presumably older versions as well) may produce correctness errors if run on Java 8. Since then quite a few people have expressed concern because they have been running GATK 3.5 or older versions on Java 8 already. Most wanted to know what would be the nature and amplitude of these problems, and whether they should re-run the affected data on Java 7.

We don't have definitive answers for these questions because we haven't performed end-to-end testing of GATK 3.5 on Java 8. Once we noticed that some of the automated tests were failing when we switched Java versions, we hunted down the source of the test failures (some Java list structures for which iteration order is not the same between Java versions) and fixed them.

So for us, the story stops there. But we understand that those of you who have been misguidedly running GATK on Java 8 need more information to decide what to do. I thought that perhaps sharing the relevant details from our migration test results might help, so I compiled a summary of per-tool tests that were affected, with some developer notes and values that were discussed in the issue ticket.

## Sneak preview of the GATK release calendar for 2016

### Posted by Geraldine_VdAuwera on 12 May 2016 (0)

Since the GATK first started gaining traction in the research community ca. 2010, its development has sustained a fairly rapid pace, with a new major version (1, 2 and 3 so far) coming out about every 2 years. Each major version was a straight continuation of the same codebase, distinguished by significantly new tools and capabilities (e.g. HaplotypeCaller in 2.0, the GVCF workflow in 3.0).

This year, we're on track for a new major version, but we're breaking the mold of the classic GATK codebase. Over the past 18 months, in parallel to the ongoing 3.x development effort, we built a brand new GATK engine that is faster, more scalable and can support new types of analysis that weren't possible in the original GATK framework. Now we're hard at work porting the classic GATK tools over to the new framework, as well as developing some new ones (copy number!). The resulting toolkit will be formally released as GATK 4 later this year. If you're keen to try it out, it's already available as an alpha preview; I'll follow up on this with more details soon.

But that's not all. At the same time, we kept working on the GATK 3.x package in order to continue delivering improvements to the research community. Now we're just about ready to release version 3.6 -- nothing yuuugely different, but quite a few bug fixes and feature enhancements (especially in the GVCF workflow tools) that have been widely requested. Again, details to follow. Oh, and it supports Java 8! Which previous versions do not -- it may look like they do because they run on it without crashing, but there could be silent correctness errors. Which are the worst; I prefer a good honest in-your-face run-busting error any day.

So there you have it; version 3.6 coming out sometime next week (-ish), and GATK 4 coming out later this year, probably Fall timeframe. In between, we'll have one last 3.x release, either a patch release (3.6-x) or a proper minor release (3.7) depending on how substantial are the changes involved, to immortalize the last state of the classic GATK before it gets encased in amber.

## The doctor is in: diagnosing sick SAM/BAM files

### Posted by dekling on 3 May 2016 (0)

Ever find yourself happily going about your day, using Picard and/or GATK when all of sudden a dreaded ERROR message is output instead of your tidy SAM/BAM or VCF file? Even worse, you try to diagnose the errors using Picard's ValidateSamFile tool and the command-line output is... shall we say, "incomplete"? Well, take comfort because we have just the right medicine for you... new documentation for Picard's ValidateSamFile tool!

This document will guide even the most novice of users through the steps of diagnosing problems lurking within their SAM/BAM files. Not only that, for the first time in the history of the world, we present tables listing and explaining all of the WARNING and ERROR message outputs of this program! Woohoo!!! Now you can effectively troubleshoot your WARNING and ERROR messages by running ValidateSamFile prior to feeding your SAM/BAM file into Picard and/or GATK.

## Let's please stop calling it NGS

### Posted by Geraldine_VdAuwera on 27 Apr 2016 (5)

Can we all agree that this is 2016 and next-generation sequencing is really just sequencing at this point?

Seriously, I was in college when NGS was becoming a thing. In techno-geological terms, that was the Cretaceous. Yet over a decade later, this super-vague term is somehow still stuck in our collective consciousness.

I'm not the one to say what is the real next generation of sequencing, maybe Nanopore and all that exotic long-read tech. My point is that calling the current generation of sequencing technology next-gen or NGS is embarrassingly retrograde and we should all stop*.

*I'm sure we have some old articles in our docs that use the term NGS, if you point them out to me I'll fix them.

Of course, there's still Sanger sequencing and we want to be able to tell the difference -- but really, isn't it Sanger that is the oddball now, and the rest is just regular sequencing? Well, if we must -- hey look we have a technically-accurate term, it's called high-throughput sequencing. It even comes with a reasonably snappy three-letter abbreviation to slap in titles and on posters where space is at a premium: HTS (putting the hts in htsjdk).

Alright, rant over. Until next time.

## UK April 14-15 Workshop Slides

### Posted by Geraldine_VdAuwera on 14 Apr 2016 (0)

The slides are available here:

## Service note: expecting forum downtime on May 14th

### Posted by Geraldine_VdAuwera on 13 Apr 2016 (0)

The company that hosts our forums will be performing some maintenance and upgrades on the servers tomorrow, Thursday May 14th. Based on the notification they sent us (included below), we should expect the forums to be unavailable for about ~45 minutes at some point between 2pm and 8pm EST.

During that time, most of the cached GATK documentation will remain available through the Guide section of the GATK website. However, some pages may show error messages and some blog content may not be available, and of course during that time it will not be possible to ask or answer questions on the forum.

Thanks for your patience, and let's all look forward to improved reliability of our forum service.

## Making GATK available on the cloud for everyone

### Posted by Geraldine_VdAuwera on 6 Apr 2016 (1)

Today, several members of our extended group are talking at the BioIT World meeting in Boston, and the Broad mothership is putting out a handful of announcements that are related to GATK. Among other communications there's a press release accompanied by a blog post on the Broad Institute blog, which unveil a landmark agreement we have reached with several major cloud vendors. I'd like to take a few minutes to discuss what is at stake, both in terms of what we're doing, and of how this will affect the wider GATK community.

These announcements all boil down to two things: we built a platform to run the Broad's GATK analysis pipelines in the cloud instead of our local cluster, and we're making that platform accessible to the wider community following a "Software as a Service" (SaaS) model.

Now, before we get any further into discussing what that entails, I want to reassure everyone that we will continue to provide the GATK software as a downloadable executable that can be used anywhere, whether locally on your laptop, on your institution's server farm or computing cluster, or on a cloud platform if you've already got that set up for yourself. The cloud-based service we're announcing is just one more option that we're making available for running GATK. And it should go without saying that we'll continue to provide the same level of support as we have in the past to everyone through the GATK forum; our commitment to that mission is absolute and unwavering.

Alright, so what's happening exactly? Read on to find out!

### Posted by Geraldine_VdAuwera on 4 Apr 2016 (0)

In my last blog post, I introduced the Cromwell+WDL pipelining solution that we developed to make it easier to write and run sophisticated analysis pipelines on cloud infrastructure. I've also mentioned in the recent past that we're building the next generation of GATK (which will be GATK 4) to run efficiently on cloud-based analysis platforms.

So in this follow-up I want to explain why we care so much about building software that runs well in the cloud, which comes down to some key benefits of "The Cloud" over more traditional computing solutions like local servers and clusters (which I'll just refer to as clusters from now on -- the distinction is not really important here).

## The Art of the Pipeline: Introducing Cromwell + WDL

### Posted by Geraldine_VdAuwera on 31 Mar 2016 (1)

Today I'm delighted to introduce WDL, pronounced widdle, a new workflow description language that is designed from the ground up as a human-readable and -writable way to express tasks and workflows.

As a lab-grown biologist (so to speak), I think analysis pipelines are amazing. The thought that you can write a script that describes a complex chain of processing and analytical operations, then all you need to do every time you have new data is feed it through the pipe, and out come results -- that's awesome. Back in the day, I learned to run BLAST jobs by copy-pasting gene sequences into the NCBI BLAST web browser window. When I found out you could write a script that takes a collection of sequences, submit them directly to the BLAST server, then extracts the results and does some more computation on them… mind blown. And this was in Perl, so there was pain involved, but it was worth it -- for a few weeks afterward I almost had a social life! In grad school! Then it was back to babysitting cell cultures day and night until I could find another excuse to do some "in silico" work. Where "in silico" was the hand-wavy way of saying we were doing some stuff with computers. Those were simpler times.

So it's been kind of a disappointment for the last decade or so that writing pipelines was still so flipping hard. I mean smart, robust pipelines that can understand things like parallelism, dependencies of inputs and outputs between tasks, and resume intelligently if they get interrupted. Sure, in the GATK world we have Queue, which is very useful in many respects, but it's tailored to specific computing cluster environments like LSF, and writing scala scripts is really not trivial if you're new to it. I wouldn't call Queue user-friendly by a long shot. Plus, we only really use it internally for development work -- to run our production pipelines, we've actually been using a more robust and powerful system called Zamboni. It's a real workhorse, and has gracefully handled the exponential growth of sequencer output up to this point, but it's more complicated to use -- I have to admit I haven't managed to wrap my head around how it works.

Fortunately I don't have to try anymore: our engineers have developed a new pipelining solution that involves WDL, the Workflow Description Language I mentioned earlier, and an execution engine called Cromwell that can run WDL scripts anywhere, whether locally or on the cloud.

It's portable, super user-friendly by design, and open-source (under BSD) -- we’re eager to share it with the world!

#### GATK Dev Team

###### @gatk_dev

Join the #GATK dev team! We have job openings for a Data Scientist and an Associate Software Engineer https://t.co/IDhYCmKzOO
###### 29 May 16
RT @BroadGenomics: Broad Genomics celebrates sequencing over 250k genomes and exomes #thatsawholelotofsequence #peoplepower #nooneelse http…
###### 21 May 16
Notes on differences in #GATK 3.5 test results between Java 7 to 8 https://t.co/QQPcwWvomd
###### 16 May 16
RT @NJL_NGS: Hey @BroadGenomics we just flew past 250,000 exomes + genomes. Good job everyone.

###### Our favorite tweets from others

The @dgmacarthur lab leaving as they came, rock stars of science in their stretch limo https://t.co/IQ0eCOT5H6
###### 14 May 16
Hey @BroadGenomics we just flew past 250,000 exomes + genomes. Good job everyone.
###### 13 May 16
@gatk_dev Hey guys thanks for another fantastic workshop, hope you all had a good time in the pub. I’m now back the other side of the wall
###### 14 Apr 16
@gatk_dev @notSoJunkDNA The cloud giveth, the cloud taketh
###### 6 Apr 16
.@gatk_dev The genotype likelihood blog post is very nice. Thank you! https://t.co/ZFPiZVaoKe via @bricesarver
###### 23 Mar 16
See more of our favorite tweets...