## The doctor is in: diagnosing sick SAM/BAM files

### Posted by dekling on 3 May 2016 (0)

Ever find yourself happily going about your day, using Picard and/or GATK when all of sudden a dreaded ERROR message is output instead of your tidy SAM/BAM or VCF file? Even worse, you try to diagnose the errors using Picard's ValidateSamFile tool and the command-line output is... shall we say, "incomplete"? Well, take comfort because we have just the right medicine for you... new documentation for Picard's ValidateSamFile tool!

This document will guide even the most novice of users through the steps of diagnosing problems lurking within their SAM/BAM files. Not only that, for the first time in the history of the world, we present tables listing and explaining all of the WARNING and ERROR message outputs of this program! Woohoo!!! Now you can effectively troubleshoot your WARNING and ERROR messages by running ValidateSamFile prior to feeding your SAM/BAM file into Picard and/or GATK.

## Let's please stop calling it NGS

### Posted by Geraldine_VdAuwera on 27 Apr 2016 (5)

Can we all agree that this is 2016 and next-generation sequencing is really just sequencing at this point?

Seriously, I was in college when NGS was becoming a thing. In techno-geological terms, that was the Cretaceous. Yet over a decade later, this super-vague term is somehow still stuck in our collective consciousness.

I'm not the one to say what is the real next generation of sequencing, maybe Nanopore and all that exotic long-read tech. My point is that calling the current generation of sequencing technology next-gen or NGS is embarrassingly retrograde and we should all stop*.

*I'm sure we have some old articles in our docs that use the term NGS, if you point them out to me I'll fix them.

Of course, there's still Sanger sequencing and we want to be able to tell the difference -- but really, isn't it Sanger that is the oddball now, and the rest is just regular sequencing? Well, if we must -- hey look we have a technically-accurate term, it's called high-throughput sequencing. It even comes with a reasonably snappy three-letter abbreviation to slap in titles and on posters where space is at a premium: HTS (putting the hts in htsjdk).

Alright, rant over. Until next time.

## UK April 14-15 Workshop Slides

### Posted by Geraldine_VdAuwera on 14 Apr 2016 (0)

The slides are available here:

## Service note: expecting forum downtime on May 14th

### Posted by Geraldine_VdAuwera on 13 Apr 2016 (0)

The company that hosts our forums will be performing some maintenance and upgrades on the servers tomorrow, Thursday May 14th. Based on the notification they sent us (included below), we should expect the forums to be unavailable for about ~45 minutes at some point between 2pm and 8pm EST.

During that time, most of the cached GATK documentation will remain available through the Guide section of the GATK website. However, some pages may show error messages and some blog content may not be available, and of course during that time it will not be possible to ask or answer questions on the forum.

Thanks for your patience, and let's all look forward to improved reliability of our forum service.

## Making GATK available on the cloud for everyone

### Posted by Geraldine_VdAuwera on 6 Apr 2016 (1)

Today, several members of our extended group are talking at the BioIT World meeting in Boston, and the Broad mothership is putting out a handful of announcements that are related to GATK. Among other communications there's a press release accompanied by a blog post on the Broad Institute blog, which unveil a landmark agreement we have reached with several major cloud vendors. I'd like to take a few minutes to discuss what is at stake, both in terms of what we're doing, and of how this will affect the wider GATK community.

These announcements all boil down to two things: we built a platform to run the Broad's GATK analysis pipelines in the cloud instead of our local cluster, and we're making that platform accessible to the wider community following a "Software as a Service" (SaaS) model.

Now, before we get any further into discussing what that entails, I want to reassure everyone that we will continue to provide the GATK software as a downloadable executable that can be used anywhere, whether locally on your laptop, on your institution's server farm or computing cluster, or on a cloud platform if you've already got that set up for yourself. The cloud-based service we're announcing is just one more option that we're making available for running GATK. And it should go without saying that we'll continue to provide the same level of support as we have in the past to everyone through the GATK forum; our commitment to that mission is absolute and unwavering.

Alright, so what's happening exactly? Read on to find out!

### Posted by Geraldine_VdAuwera on 4 Apr 2016 (0)

In my last blog post, I introduced the Cromwell+WDL pipelining solution that we developed to make it easier to write and run sophisticated analysis pipelines on cloud infrastructure. I've also mentioned in the recent past that we're building the next generation of GATK (which will be GATK 4) to run efficiently on cloud-based analysis platforms.

So in this follow-up I want to explain why we care so much about building software that runs well in the cloud, which comes down to some key benefits of "The Cloud" over more traditional computing solutions like local servers and clusters (which I'll just refer to as clusters from now on -- the distinction is not really important here).

## The Art of the Pipeline: Introducing Cromwell + WDL

### Posted by Geraldine_VdAuwera on 31 Mar 2016 (1)

Today I'm delighted to introduce WDL, pronounced widdle, a new workflow description language that is designed from the ground up as a human-readable and -writable way to express tasks and workflows.

As a lab-grown biologist (so to speak), I think analysis pipelines are amazing. The thought that you can write a script that describes a complex chain of processing and analytical operations, then all you need to do every time you have new data is feed it through the pipe, and out come results -- that's awesome. Back in the day, I learned to run BLAST jobs by copy-pasting gene sequences into the NCBI BLAST web browser window. When I found out you could write a script that takes a collection of sequences, submit them directly to the BLAST server, then extracts the results and does some more computation on them… mind blown. And this was in Perl, so there was pain involved, but it was worth it -- for a few weeks afterward I almost had a social life! In grad school! Then it was back to babysitting cell cultures day and night until I could find another excuse to do some "in silico" work. Where "in silico" was the hand-wavy way of saying we were doing some stuff with computers. Those were simpler times.

So it's been kind of a disappointment for the last decade or so that writing pipelines was still so flipping hard. I mean smart, robust pipelines that can understand things like parallelism, dependencies of inputs and outputs between tasks, and resume intelligently if they get interrupted. Sure, in the GATK world we have Queue, which is very useful in many respects, but it's tailored to specific computing cluster environments like LSF, and writing scala scripts is really not trivial if you're new to it. I wouldn't call Queue user-friendly by a long shot. Plus, we only really use it internally for development work -- to run our production pipelines, we've actually been using a more robust and powerful system called Zamboni. It's a real workhorse, and has gracefully handled the exponential growth of sequencer output up to this point, but it's more complicated to use -- I have to admit I haven't managed to wrap my head around how it works.

Fortunately I don't have to try anymore: our engineers have developed a new pipelining solution that involves WDL, the Workflow Description Language I mentioned earlier, and an execution engine called Cromwell that can run WDL scripts anywhere, whether locally or on the cloud.

It's portable, super user-friendly by design, and open-source (under BSD) -- we’re eager to share it with the world!

## Metrics say percent – Double-check those decimals – Fractions everywhere

### Posted by shlee on 18 Mar 2016 (0)

Picard metrics that say 'percent' actually mean 'fraction'. Let's take metrics from MarkDuplicates as an example. Under PERCENT_DUPLICATION we see 0.134008. If we divide READ_PAIR_DUPLICATES by READ_PAIRS_EXAMINED we get ~1/7 or 14%. Our sanity check makes clear the PERCENT_DUPLICATION metric is a fraction that translates to 13.4%.

## How long does it take to run the GATK Best Practices?

### Posted by Geraldine_VdAuwera on 16 Mar 2016 (0)

When you're setting up a variant discovery pipeline, you face two problems: deciding what tools to run (with what options), and how to run them efficiently so that it doesn't take forever. Between our documentation and our support forum, we can get you most if not all the way to solving the first problem, unless you're working with something really unusual.

However, the second problem is not something we've been able to help much with. We only benchmark computational requirements/performance for the purposes of our in-house pipelines, which are very specific to our particular infrastructure, and we don't have the resources to test different configurations. As a result it's been hard for us to give satisfying answers to questions like "How long should this take?" or "How much RAM do I need?" -- and we're aware this is a big point of pain.

So I'm really pleased to announce that a team of engineers at Intel have been developing a system to profile pipelines that implement our Best Practices workflows on a range of hardware configurations. This is a project we've been supporting by providing test data and implementation advice, and it's really gratifying to see it bear fruit: the team recently published their first round of profiling, done on the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF) on a trio of whole genomes.

The white paper is available from the GATK-specific page of Intel's Health-IT initiative website and contains some very useful insights regarding key bottlenecks in the pipeline. It also details the applicability of parallelizing options for each tool, as well as the effect of using different numbers of threads on performance, when run on a single 36-core machine. Spoiler alert: more isn't always better!

Read on for a couple of highlights of what I thought was especially interesting in the Intel team's white paper.

## Workshop presentation slides and tutorial materials - UCLA 2016

### Posted by Geraldine_VdAuwera on 4 Mar 2016 (0)

#### GATK Dev Team

###### @gatk_dev

@smllmp Always happy to help if you encounter any issues.
###### 3 May 16
@smllmp Maybe a transient server error. Will check and see if we can improve error handling.
###### 3 May 16
The doctor is in: advice for diagnosing sick SAM/BAM files with Picard ValidateSamFile https://t.co/9pTHVMtG8L
###### 3 May 16
@smllmp You're using an internal server path, don't know who gave you that url but should just go to https://t.co/FQxJ8LDqfJ
###### 3 May 16
RT @kasper_lage: Public service announcement: @gaddyg account is showing a pulse again. All interested in cancer, genetics & science should…

###### Our favorite tweets from others

@gatk_dev Hey guys thanks for another fantastic workshop, hope you all had a good time in the pub. I’m now back the other side of the wall
###### 14 Apr 16
@gatk_dev @notSoJunkDNA The cloud giveth, the cloud taketh
###### 6 Apr 16
.@gatk_dev The genotype likelihood blog post is very nice. Thank you! https://t.co/ZFPiZVaoKe via @bricesarver
###### 23 Mar 16
@gatk_dev thanks, will certainly have a look over WDL + Cromwell. Also like the memes on front page!
###### 18 Mar 16
Great to learn about GATK at the GATK workshop at @UCLA. Thanks @gatk_dev team!
###### 2 Mar 16
See more of our favorite tweets...