Tagged with #workflow
2 documentation articles | 1 announcement | 4 forum discussions



Created 2015-08-14 01:57:37 | Updated 2015-08-14 01:58:03 | Tags: best-practices workshop workflow presentations

Comments (0)

Joel Thibault, Valentin Ruano-Rubio and Geraldine Van der Auwera presented this workshop in Edinburgh, Scotland, and Cambridge, England, upon invitation from the Universities of Edinburgh and Cambridge.

This workshop included two modules:

  • Best Practices for Variant Calling with the GATK

    The core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

  • Beyond the Best Practices

    Additional considerations such as calling variants in RNAseq data and calling cohorts efficiently, as well as dealing with non-human data, RNAseq data, whole-genome vs. exome, basic quality control, and performance.

This was complemented by a set of hands-on exercises aiming to teach basic GATK usage to new users.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.


Created 2015-08-14 01:50:16 | Updated 2015-08-14 01:56:56 | Tags: best-practices workshop workflow presentations

Comments (6)

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.


Created 2016-03-31 17:40:04 | Updated 2016-04-11 18:30:23 | Tags: workflow pipeline topstory cromwell wdl

Comments (1)

Today I'm delighted to introduce WDL, pronounced widdle, a new workflow description language that is designed from the ground up as a human-readable and -writable way to express tasks and workflows.

As a lab-grown biologist (so to speak), I think analysis pipelines are amazing. The thought that you can write a script that describes a complex chain of processing and analytical operations, then all you need to do every time you have new data is feed it through the pipe, and out come results -- that's awesome. Back in the day, I learned to run BLAST jobs by copy-pasting gene sequences into the NCBI BLAST web browser window. When I found out you could write a script that takes a collection of sequences, submit them directly to the BLAST server, then extracts the results and does some more computation on them… mind blown. And this was in Perl, so there was pain involved, but it was worth it -- for a few weeks afterward I almost had a social life! In grad school! Then it was back to babysitting cell cultures day and night until I could find another excuse to do some "in silico" work. Where "in silico" was the hand-wavy way of saying we were doing some stuff with computers. Those were simpler times.

So it's been kind of a disappointment for the last decade or so that writing pipelines was still so flipping hard. I mean smart, robust pipelines that can understand things like parallelism, dependencies of inputs and outputs between tasks, and resume intelligently if they get interrupted. Sure, in the GATK world we have Queue, which is very useful in many respects, but it's tailored to specific computing cluster environments like LSF, and writing scala scripts is really not trivial if you're new to it. I wouldn't call Queue user-friendly by a long shot. Plus, we only really use it internally for development work -- to run our production pipelines, we've actually been using a more robust and powerful system called Zamboni. It's a real workhorse, and has gracefully handled the exponential growth of sequencer output up to this point, but it's more complicated to use -- I have to admit I haven't managed to wrap my head around how it works.

Fortunately I don't have to try anymore: our engineers have developed a new pipelining solution that involves WDL, the Workflow Description Language I mentioned earlier, and an execution engine called Cromwell that can run WDL scripts anywhere, whether locally or on the cloud.

It's portable, super user-friendly by design, and open-source (under BSD) -- we’re eager to share it with the world!


In a nutshell, WDL was designed to make it easy to write scripts that describe analysis tasks and chain those tasks into workflows, with built-in support for advanced features like scatter-gather parallelism. This came at the cost of some additional headaches for the engineering team when they built Cromwell, the execution engine that runs WDL scripts -- but that was a conscious design decision, to push off the complexity onto the engineers rather than leave it in the lap of pipeline authors.

You can learn more about WDL at https://software.broadinstitute.org/wdl/, a website we put together to provide detailed documentation on how to write WDL scripts and run them on Cromwell. It's further enriched by a number of tutorials and real GATK-focused workflows provided as examples, and in the near future we'll also share our GATK Best Practices production pipeline scripts.

Yes, I just said that, we're going to be sharing full-on production scripts! In the past we refused to share scripts because they weren't portable across systems and they were difficult to read, so we were concerned that sharing them would generate more support burden than we could handle. But now with that we have a solution that is both portable and user-friendly, we're not so worried -- and we're very excited about this as an opportunity to improve reproducibility among those who adopt our Best Practices in their work.


Of course it won't be the first time someone comes up with a revolutionary way to do things that ends up not working all that well in the real world. To that I say -- have you ever heard of the phrase "eating your own dog food", or "dogfooding" for short? It refers to a practice in software engineering where developers use their own product for running critical systems. I'm happy to say that we are now dogfooding Cromwell and WDL in our own production pipelines; a substantial part of the Broad's own analysis work, which involves processing terabytes worth of whole genome sequencing data, is now done on the cloud using Cromwell to execute pipelines written entirely in WDL.

That's how we know it works at scale, it's solid, and it does what we need it to do. That being said, there are a few additional features that are defined in WDL but not yet fully implemented in Cromwell, as well as a wishlist of convenience functions requested by our own methods development team, which the engineering team will pursue over the coming months. So its capabilities will keep growing as we add convenience functions to address our own and the wider community's pipelining needs.

Finally, I want to reiterate that WDL itself is meant to be very accessible to people who are not engineers or programmers; our own methods development team and various scientific collaborators have all commented that it's very easy to write and understand. So please try it out, and let us know what you think in the WDL forum!


Created 2015-07-21 22:33:32 | Updated 2015-07-21 22:33:53 | Tags: workflow pipeline exome cancer

Comments (1)

I am new to Bioinformatics, and would like some advice on changes to the GATK workflow for cancer. I was told that the cancer workflow is different, and see that several different tools are available.

I have Exome data from tumor and normal. I have aligned them, and have BAM files for each sample. I am interested in identifying somatic variants.

The current workflow as I understand it is: -(Non-GATK) Picard Mark Duplicates or Samtools roundup -Indel Realignment (Realigner TargetCreator + Indel Realigner) -Base Quality Score Reacalibration (Base Recalibrator + PrintReads) -UnifiedGenotyper -Annotation using Oncotator (?) -MuTect (identify somatic mutations)

**My questions are:

  1. Is the above workflow reasonable/correct for what I'm trying to do?
  2. Is there any difference running samples one pair at a time, or running them all together? (I have 57 pairs. Should I do 57 runs of normal-tumor pairs, or 1 run of all 57 pairs?)**

Thank you, Gaius


Created 2014-02-12 20:31:19 | Updated | Tags: variantrecalibrator workflow multi-sample variant-calling

Comments (1)

Hi there,

We are sequencing a set of regions that covers about 1.5 megabases in total. We're running into problems with VQSR -- VariantRecalibrator says there are too few variants to do recalibration. To give a sense of numbers, in one sample we have about 3000 SNVs and 600 indels.

We seem to have too few indels to do VQSR on them and have a couple of questions:

  1. Can we combine multiple samples to increase the number of variants, or does VariantRecalibrator need to work on each sample individually?

  2. If we do not use VQSR for indels, should we also avoid VQSR for the SNPs?

  3. The other question is whether joint or batch variant calling across several samples would help us in this case?

Thanks in advance!


Created 2013-10-30 08:31:12 | Updated | Tags: license workflow graphics permission

Comments (3)

I have used GATK in my PhD project, and was wondering if I could get the permission to use the Best Practices workflow graphics [1] in my doctoral dissertation? How should I attribute your copyright?

[1] http://www.broadinstitute.org/gatk/img/BP_workflow.png


Created 2013-06-12 16:09:40 | Updated | Tags: commandlinegatk workflow rnaseq

Comments (15)

Hi all: I find that among all the work flows of GATK http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflows there are no workflows for RNA-seq analysis. I understand that GATK mainly focuses on variant calling, can anyone tell me how to use GATK for RNA-seq analysis?

thanks daniel