This document explains the concepts involved and how they are applied within the GATK (and Queue where applicable). For specific configuration recommendations, see the companion document on parallelizing GATK tools.
Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one).
Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.
This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work.
The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see doc for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.
Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.
Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.
OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?
Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.
Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:
open the file, count the number of lines in the file, tell us the number, close the file
tell us the number can mean writing it to the console, or storing it somewhere for use later on.
Now let's say we want to know the number of words on each line. The set of instructions would be:
open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the number
And so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.
So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:
open the file, index the lines
read the first line, count the number of words, tell us the number
read the second line, count the number of words, tell us the number
read the third line, count the number of words, tell us the number
[repeat for all lines]
collect final results and close the file
read the Nth line steps can be performed in parallel, because they are all independent operations.
You'll notice that we added a step,
index the lines. That's a little bit of peliminary work that allows us to perform the
read the Nth line steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.
Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.
There are three different modes of parallelism offered by the GATK, and to really understand the difference you first need to understand what are the different levels of computing that are involved.
By levels of computing, we mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster.
Core: the level below the machine. On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.
Machine: the middle of the scale. For most of us, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say machine. Except if the machine is part of a cluster, in which case it's called a node.
Cluster: the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called nodes) networked together. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a server farm or a load-sharing facility.
Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of multi-threading at the core and machine levels, and scatter-gather at the cluster level.
In computing, a thread of execution is a set of instructions that the program issues to the processor to get work done. In single-threading mode, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In multi-threading mode, the program may send several threads to the processor at the same time.
Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results.
If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. He can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that.
Hey, if you have a better example, let us know in the forum and we'll use that instead.
Alright, now that you understand the idea of multithreading, let's get practical: how do we do get the GATK to use multi-threading?
There are two options for multi-threading with the GATK, controlled by the arguments
-nct, respectively. They can be combined, since they act at different levels of computing:
--num_threads controls the number of data threads sent to the processor (acting at the machine level)
--num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread (acting at the core level).
Not all GATK tools can use these options due to the nature of the analyses that they perform and how they traverse the data. Even in the case of tools that are used sequentially to perform a multi-step process, the individual tools may not support the same options. For example, at time of writing (Dec. 2012), of the tools involved in local realignment around indels, RealignerTargetCreator supports
-nt but not
-nct, while IndelRealigner does not support either of these options.
In addition, there are some important technical details that affect how these options can be used with optimal results. Those are explained along with specific recommendations for the main GATK tools in a companion document on parallelizing the GATK.
If you Google it, you'll find that the term scatter-gather can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.
So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?
As you know by now, multithreading specifically refers to what happens internally when the program (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy as used by the GATK involves a separate program, called Queue, which generates separate GATK jobs (each with its own command-line) to achieve the instructions given in a so-called Qscript (i.e. a script written for Queue in a programming language called Scala).
At the simplest level, the Qscript can involve a single GATK tool*. In that case Queue will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, Queue will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).
Note that Queue has additional capabilities, such as managing the use of multiple GATK tools in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. To learn more about pipelining the GATK with Queue, please see the Queue documentation.
So you see, scatter-gather is a very different process from multi-threading because the parallelization happens outside of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But Queue can dispatch scattered GATK jobs to different machines in a computing cluster by interfacing with your cluster's job management software.
That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs.
The good news is that you can combine scatter-gather and multithreading: use Queue to scatter GATK jobs to different nodes on your cluster, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.
Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes.
This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results.
As explained in the primer on parallelism for the GATK, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using Queue).
There are two options for multi-threading with the GATK, controlled by the arguments
-nct, respectively, which can be combined:
-nt / --num_threadscontrols the number of data threads sent to the processor
-nct / --num_cpu_threads_per_data_threadcontrols the number of CPU threads allocated to each data thread
For more information on how these multi-threading options work, please read the primer on parallelism for the GATK.
Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use
-nt 4, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use.
-nctwith versions 2.2 and 2.3
Because of the way the
-nct option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use
-nct, you’ll only really start seeing a speedup with
-nct 3 (which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up.
For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.
Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs.
|Tool||Full name||Type of traversal||NT||NCT||SG|
The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you.
|Cluster nodes||1||4||4||1||4||4 / 4 / 4|
|CPU threads (
||1||1||8||4-8||1||3 / 6 / 24|
|Data threads (
||24||1||1||1||1||8 / 4 / 1|
|Memory (Gb)||48||4||4||4||4||32 / 16 / 4|
Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue. For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.
Scala is a combination of an object oriented framework and a functional programming language. For a good introduction see the free online book Programming Scala.
The following are extremely brief answers to frequently asked questions about Scala which often pop up when first viewing or editing QScripts. For more information on Scala there a multitude of resources available around the web including the Scala home page and the online Scala Doc.
var is a value you can later modify, while
val is similar to
final in Java.
Because the GATK and Queue are a mix of Scala and Java sometimes you'll run into problems when you need a Scala collection and instead a Java collection is returned.
MyQScript.scala:39: error: type mismatch; found : java.util.List[java.lang.String] required: scala.List[String] val wrapped: List[String] = TextFormattingUtils.wordWrap(text, width)
Use the implicit definitions in
JavaConversions to automatically convert the basic Java collections to and from Scala collections.
Scala has a very rich collections framework which you should take the time to enjoy. One of the first things you'll notice is that the default Scala collections are immutable, which means you should treat them as you would a String. When you want to 'modify' an immutable collection you need to capture the result of the operation, often assigning the result back to the original variable.
var str = "A" str + "B" println(str) // prints: A str += "C" println(str) // prints: AC var set = Set("A") set + "B" println(set) // prints: Set(A) set += "C" println(set) // prints: Set(A, C)
:+ operator for a single value.
var myList = List.empty[String] myList :+= "a" myList :+= "b" myList :+= "c"
++ for appending a list.
var myList = List.empty[String] myList ++= List("a", "b", "c")
var mySet = Set.empty[String] mySet += "a" mySet += "b" mySet += "c"
var myMap = Map.empty[String,Int] myMap += "a" -> 1 myMap += "b" -> 2 myMap += "c" -> 3
Option is a Scala generic type that can either be some generic value or
None. Queue often uses it to represent primitives that may be null.
var myNullableInt1: Option[Int] = Some(1) var myNullableInt2: Option[Int] = None
François Armand's slide deck is a good introduction: http://www.slideshare.net/normation/scala-dreaded
To quote from his slides:
Give me a variable name but - I don't care of what it is - and/or - don't want to pollute my namespace with it
This Java snippet:
String formatted = String.format("%s %i", myString, myInt);
In Scala would be:
val formatted = "%s %i".format(myString, myInt)
No. Currently Scala's
Enumeration class does not interact with the Java reflection API in a way that could be used for Queue command line arguments. You can use Java
enums if for example you are importing a Java based walker's
If/when we find a workaround for Queue we'll update this entry. In the meantime try using a String.
Test that Queue is correctly installed, and that the supporting tools like Java are in your path.
The command we're going to run is a very simple command that asks Queue to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your Queue package is installed correctly.
Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.
Type the following command:
java -jar <path to Queue.jar> --help
<path to Queue.jar> bit with the path you have set up in your command-line environment.
You should see usage output similar to the following:
usage: java -jar Queue.jar -S <script> [-jobPrefix <job_name_prefix>] [-jobQueue <job_queue>] [-jobProject <job_project>] [-jobSGDir <job_scatter_gather_directory>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir <temp_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>] [-emailTLS] [-emailSSL] [-emailUser <emailUsername>] [-emailPass <emailPassword>] [-emailPassFile <emailPasswordFile>] [-bsub] [-run] [-dot <dot_graph>] [-expandedDot <expanded_dot_graph>] [-startFromScratch] [-status] [-statusFrom <status_email_from>] [-statusTo <status_email_to>] [-keepIntermediates] [-retry <retry_failed>] [-l <logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h] -S,--script <script> QScript scala file -jobPrefix,--job_name_prefix <job_name_prefix> Default name prefix for compute farm jobs. -jobQueue,--job_queue <job_queue> Default queue for compute farm jobs. -jobProject,--job_project <job_project> Default project for compute farm jobs. -jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory> Default directory to place scatter gather output for compute farm jobs. -memLimit,--default_memory_limit <default_memory_limit> Default memory limit for jobs, in gigabytes. -runDir,--run_directory <run_directory> Root directory to run functions from. -tempDir,--temp_directory <temp_directory> Temp directory to pass to functions. -emailHost,--emailSmtpHost <emailSmtpHost> Email SMTP host. Defaults to localhost. -emailPort,--emailSmtpPort <emailSmtpPort> Email SMTP port. Defaults to 465 for ssl, otherwise 25. -emailTLS,--emailUseTLS Email should use TLS. Defaults to false. -emailSSL,--emailUseSSL Email should use SSL. Defaults to false. -emailUser,--emailUsername <emailUsername> Email SMTP username. Defaults to none. -emailPass,--emailPassword <emailPassword> Email SMTP password. Defaults to none. Not secure! See emailPassFile. -emailPassFile,--emailPasswordFile <emailPasswordFile> Email SMTP password file. Defaults to none. -bsub,--bsub_all_jobs Use bsub to submit jobs -run,--run_scripts Run QScripts. Without this flag set only performs a dry run. -dot,--dot_graph <dot_graph> Outputs the queue graph to a .dot file. See: http://en.wikipedia.org/wiki/DOT_language -expandedDot,--expanded_dot_graph <expanded_dot_graph> Outputs the queue graph of scatter gather to a .dot file. Otherwise overwrites the dot_graph -startFromScratch,--start_from_scratch Runs all command line functions even if the outputs were previously output successfully. -status,--status Get status of jobs for the qscript -statusFrom,--status_email_from <status_email_from> Email address to send emails from upon completion or on error. -statusTo,--status_email_to <status_email_to> Email address to send emails to upon completion or on error. -keepIntermediates,--keep_intermediate_outputs After a successful run keep the outputs of any Function marked as intermediate. -retry,--retry_failed <retry_failed> Retry the specified number of times after a command fails. Defaults to no retries. -l,--logging_level <logging_level> Set the minimum level of logging, i.e. setting INFO get's you INFO up to FATAL, setting ERROR gets you ERROR and FATAL level logging. -log,--log_to_file <log_to_file> Set the logging location -quiet,--quiet_output_mode Set the logging to quiet mode, no output to stdout -debug,--debug_mode Set the logging file string to include a lot of debugging information (SLOW!) -h,--help Generate this help message
If you see this message, your Queue installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.
Let's try to figure out what's not working.
First, make sure that your Java version is at least 1.6, by typing the following command:
You should see something similar to the following text:
java version "1.6.0_12" Java(TM) SE Runtime Environment (build 1.6.0_12-b04) Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like
java: Command not found
make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.
On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.
Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.
The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.
First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the
bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)
To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.
Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.
You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.
Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.
To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.
The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.
The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK.
Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced).
This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.
Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.
|Argument (short-name)||Argument (long-name)||Description|
|-i <BAM file / BAM list>||--input <BAM file / BAM list>||input BAM file - or list of BAM files.|
|-R <fasta>||--reference <fasta>||Reference fasta file.|
|-D <vcf>||--dbsnp <dbsnp vcf>||dbsnp ROD to use (must be in VCF format).|
|Argument (short-name)||Argument (long-name)||Description|
|-indels <vcf>||--extra_indels <vcf>||VCF files to use as reference indels for Indel Realignment.|
|-bwa <path>||--path_to_bwa <path>||The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option)|
|-outputDir <path>||--output_directory <path>||Output path for the processed BAM files.|
|-L <GATK interval string>||--gatk_interval_string <GATK interval string>||the -L interval string to be used by GATK - output bams at interval only|
|-intervals <GATK interval file>||--gatk_interval_file <GATK interval file>||an intervals file to be used by GATK - output bams at intervals|
|Argument (short-name)||Argument (long-name)||Description|
|-p <name>||--project <name>||the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam|
|-knowns||--knowns_only||Perform cleaning on knowns only.|
|-sw||--use_smith_waterman||Perform cleaning using Smith Waterman|
|-bwase||--use_bwa_single_ended||Decompose input BAM file and fully realign it using BWA and assume Single Ended reads|
|-bwape||--use_bwa_pair_ended||Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads|
Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):
Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:
This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.
This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration.
This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.
This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate
The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output.
The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.
We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.
Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.
PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.
Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)
java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run
Performing realignment and the full data processing pipeline in one pair-ended bam file
java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run
Register now for a spot at the upcoming GATK workshop, which will be held in Cambridge, MA on October 21-22.
This workshop will cover the following topics:
The workshop is scheduled right before ASHG Boston, so if you're going to be in town for the conference, make sure you come a couple of days early and attend the GATK workshop!
Please be aware that if you have been using BaseRecalibrator scatter-gathered with Queue (GATK versions 2.0 and 2.1), your results may be wrong. You will need to redo the base recalibration of your data WITHOUT scatter-gathering.
This issue will be fixed in the next release (version 2.2). We apologize for any inconvenience this may cause you!