# Tagged with #dataprocessingpipeline 2 documentation articles | 0 announcements | 12 forum discussions

### Introduction

Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.

The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

### BWA alignment

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)

To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.

### Best Practices for Variant Calling

Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.

### Problems with Variant Calling with Pacific Biosciences

• Calling must be more permissive of indels in the data.

You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.

• Base quality thresholds should be adjusted to the specifics of your data.

Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.

• Reference bias

To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.

### Please note that the DataProcessingPipeline qscript is no longer available.

The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.

## Data Processing Pipeline

The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK.

### Introduction

Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced).

### Requirements

This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.

Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.

### Command-line arguments

#### Required Parameters

Argument (short-name) Argument (long-name) Description
-i <BAM file / BAM list> --input <BAM file / BAM list> input BAM file - or list of BAM files.
-R <fasta> --reference <fasta> Reference fasta file.
-D <vcf> --dbsnp <dbsnp vcf> dbsnp ROD to use (must be in VCF format).

#### Optional Parameters

Argument (short-name) Argument (long-name) Description
-indels <vcf> --extra_indels <vcf> VCF files to use as reference indels for Indel Realignment.
-bwa <path> --path_to_bwa <path> The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option)
-outputDir <path> --output_directory <path> Output path for the processed BAM files.
-L <GATK interval string> --gatk_interval_string <GATK interval string> the -L interval string to be used by GATK - output bams at interval only
-intervals <GATK interval file> --gatk_interval_file <GATK interval file> an intervals file to be used by GATK - output bams at intervals

#### Modes of Operation (also optional parameters)

Argument (short-name) Argument (long-name) Description
-p <name> --project <name> the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam
-knowns --knowns_only Perform cleaning on knowns only.
-sw --use_smith_waterman Perform cleaning using Smith Waterman
-bwase --use_bwa_single_ended Decompose input BAM file and fully realign it using BWA and assume Single Ended reads
-bwape --use_bwa_pair_ended Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads

## The Pipeline

Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):

Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:

### BWA alignment

This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.

### Sample Level Processing

This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration.

#### Indel Realignment

This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.

#### Base Quality Score Recalibration

This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate

### The Outputs

The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output.

#### Processed Bam File

The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.

#### Validation Files

We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.

Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.

#### Base Quality Score Recalibration Analysis

PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.

### Examples

1. Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)

java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run

2. Performing realignment and the full data processing pipeline in one pair-ended bam file

java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run

No posts found with the requested search criteria.

My lab is trying to use Queue, but out pipeline is spawning very large number of jobs (as would be expected). Our lab has very spiky data processing requiernments and it would be useful to be able to limit the number of jobs queue submits at a time so other, non-queue jobs can process in parallel. Is this possible?

Thanks

basically, I am using DataProcessingPiepline, and interestingly, given an unaligned bam file which is 6gig, the DataProcessingPipeline sometimes does the all processes of the pipeline but produces a very small output (200K) or most of the times, doesn't produce anything. The log file is as following,

INFO 11:57:55,892 QScriptManager - Compiling 1 QScript INFO 11:58:08,656 QScriptManager - Compilation complete INFO 11:58:09,062 HelpFormatter - ---------------------------------------------------------------------- INFO 11:58:09,062 HelpFormatter - Queue v2.4-3-g2a7af43, Compiled 2013/02/27 12:20:10 INFO 11:58:09,062 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:58:09,063 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:58:09,063 HelpFormatter - Program Args: -S /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-6-gebbba25/DataProcessingPipeline.scala -bwa /broad/software/free/Linux/redhat_5_x86_64/pkgs/bwa_0.5.9/bwa -i /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/xc1094WELa.49487._new.bam -R /medpop/mpg-psrl/Parabase/hg19/ucsc.hg19.fasta -D /medpop/mpg-psrl/Parabase/hg19/dbsnp_137.hg19.vcf -p c1094WELa.49487 -outputDir /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/ -bwape -run INFO 11:58:09,063 HelpFormatter - Date/Time: 2013/03/04 11:58:09 INFO 11:58:09,063 HelpFormatter - ---------------------------------------------------------------------- INFO 11:58:09,063 HelpFormatter - ---------------------------------------------------------------------- INFO 11:58:09,071 QCommandLine - Scripting DataProcessingPipeline INFO 11:58:09,314 QCommandLine - Added 13 functions INFO 11:58:09,314 QGraph - Generating graph. INFO 11:58:09,342 QGraph - Running jobs. INFO 11:58:09,358 QGraph - 0 Pend, 0 Run, 0 Fail, 13 Done INFO 11:58:09,409 QCommandLine - Writing final jobs report... INFO 11:58:09,410 QJobsReporter - Writing JobLogging GATKReport to file /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.txt INFO 11:58:09,432 QJobsReporter - Plotting JobLogging GATKReport to file /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.pdf WARN 11:58:09,452 RScriptExecutor - Skipping: Rscript (resource)org/broadinstitute/sting/queue/util/queueJobReport.R /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.txt /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.pdf INFO 11:58:09,456 QCommandLine - Script completed successfully with 13 total jobs

and the command I am using,

java -Xmx4g -Djava.io.tmpdir=/medpop/mpg-psrl/Parabase/tmp/ -jar /medpop/mpg-psrl/Parabase/Tools/Queue-2.4-3-g2a7af43/Queue.jar -S /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-6-gebbba25/DataProcessingPipeline.scala -bwa /broad/software/free/Linux/redhat_5_x86_64/pkgs/bwa_0.5.9/bwa -i /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/c1094WELa.49487._new.bam -R /medpop/mpg-psrl/Parabase/hg19/ucsc.hg19.fasta -D /medpop/mpg-psrl/Parabase/hg19/dbsnp_137.hg19.vcf -p c1094WELa.49487 -outputDir /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/ -bwape -run > /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/c1094WELa.49487.gatkpipline.log

What could be wrong here, since it doesn't really complain about anything ! Thanks,

I am using Queue v2.3-5, and GenomeAnalysisTK-2.3-6-gebbba25 for a couple of my Bam files, I have encountered the following error, and I was wondering what could be the reason ...

ERROR 16:14:17,964 FunctionEdge - Error: hg19/ucsc.hg19.fasta /c1193JYUa.49483._new.1.2.sai /c1193JYUa.49483._new.reverted.bam /c1193JYUa.49483._new.reverted.bam > /c1193JYUa.49483._new.1.realigned.sam
ERROR 16:14:18,264 FunctionEdge - Contents of /medpop/mpg-psrl/Parabase/Projects/Seattle/c1193JYUa.49483._new.1.realigned.sam.out:


my command is as following

java -Xmx4g -Djava.io.tmpdir=/tmp/ -jar /Tools/Queue-2.3-5-g49ed93c/Queue.jar -S /Tools/Queue-2.3-6-gebbba25/DataProcessingPipeline.scala -bwa /broad/software/free/Linux/redhat_5_x86_64/pkgs/bwa_0.5.9/bwa -i /c1193JYUa.49483/c1193JYUa.49483._new.bam -R /hg19/ucsc.hg19.fasta -D hg19/dbsnp_137.hg19.vcf -p /c1193JYUa.49483/ -bwape -run > /c1193JYUa.49483/c1193JYUa.49483.gatkpipline.log

I have shortened the path for the sake of clarity ... thank you

I am facing a "fatal error by java runtime enviormnet" after using GATK DataProcessingPipeline, my java version is

> java version "1.6.0_35"
> Java(TM) SE Runtime Environment (build 1.6.0_35-b10)
> Java HotSpot(TM) 64-Bit Server VM (build 20.10-b01, mixed mode)
>


and I am using GATK 2.3-6-gebbba25 The pipeline spit out the following error,

> lelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/medpop/mpg-psrl/Parabase/tmp'  '-cp' '/medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'RealignerTargetCreator'  '-I' '/medpop/mpg-psrl/Parabase/MyRuns/EX_c1004CARa_1ln_hg19._new.1.realigned.rg.bam'  '-R' '/medpop/mpg-psrl/Parabase/hg19/ucsc.hg19.fasta'  '-o' '/medpop/mpg-psrl/Parabase/MyRuns/EX_c1004CARa_1ln_hg19.EX_c1004CARa_1ln_hg19.intervals'  '-known' '/medpop/mpg-psrl/Parabase/hg19/dbsnp_137.hg19.vcf'  '-mismatch' '0.0'
> INFO  11:32:30,896 FunctionEdge - Output written to /medpop/mpg-psrl/Parabase/MyRuns/EX_c1004CARa_1ln_hg19.EX_c1004CARa_1ln_hg19.intervals.out
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00002ae7a5a50380, pid=562, tid=47174397446800
> #
> # JRE version: 6.0_35-b10
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.10-b01 mixed mode linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x712380]  SR_handler(int, siginfo*, ucontext*)+0x30
> #
> # An error report file with more information is saved as:
> # /medpop/mpg-psrl/Parabase/MyRuns/hs_err_pid562.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #
>


the log file is lengthy; but If you like to have a look I can paste here it later on, thanks for your support,

Is there any rule of thumb for allocating memory through "bsub" for running DataProcessingPipeline per bam file or per number of reads ?

Thanks

I am running the following command to test my scala using GATK-2.3.5

> java -Xmx4g -Djava.io.tmpdir=tmp -jar /Queue-2.3-5-g49ed93c/Queue.jar -S Queue-2.3-5-g49ed93c/ExampleCountReads.scala -R /GATK-2.3-5-g49ed93c/resources/exampleFASTA.fasta -I /GATK-2.3-5-g49ed93c/resources/exampleBAM.bam

and I am getting this error

> INFO  13:42:55,166 QScriptManager - Compiling 1 QScript
> ERROR 13:42:55,348 QScriptManager - ExampleCountReads.scala:39: in XML literal: '=' expected instead of 'G'
> ERROR 13:42:55,356 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help
> ERROR 13:42:55,356 QScriptManager -                           ^
> ERROR 13:42:55,357 QScriptManager - ExampleCountReads.scala:39: in XML literal: ' or " delimited attribute value or '{' scala-expr '}' expected
> ERROR 13:42:55,358 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help
> ERROR 13:42:55,358 QScriptManager -                            ^
> ERROR 13:42:55,358 QScriptManager - ExampleCountReads.scala:39: in XML literal: whitespace expected
> ERROR 13:42:55,359 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help
> ERROR 13:42:55,360 QScriptManager -                             ^
> ERROR 13:42:55,360 QScriptManager - ExampleCountReads.scala:39: in XML literal: '=' expected instead of '>'
> ERROR 13:42:55,361 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help
> ERROR 13:42:55,362 QScriptManager -                                               ^
> ERROR 13:42:55,362 QScriptManager - ExampleCountReads.scala:39: in XML literal: ' or " delimited attribute value or '{' scala-expr '}' expected
> ERROR 13:42:55,363 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help
> ERROR 13:42:55,365 QScriptManager -                                                 ^
> ERROR 13:42:55,366 QScriptManager - ExampleCountReads.scala:39: in XML literal: whitespace expected
> ERROR 13:42:55,367 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help
> ERROR 13:42:55,367 QScriptManager -                                                  ^
> ERROR 13:42:55,367 QScriptManager - ExampleCountReads.scala:39: in XML literal: '>' expected instead of ' '
> ERROR 13:42:55,369 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help
> ERROR 13:42:55,369 QScriptManager -                                                   ^
> ERROR 13:42:55,369 QScriptManager - ExampleCountReads.scala:67: in XML literal: in XML content, please use '}}' to express '}'
> ERROR 13:42:55,369 QScriptManager -   }
> ERROR 13:42:55,369 QScriptManager -   ^
> ERROR 13:42:55,370 QScriptManager - ExampleCountReads.scala:39:  I encountered a '}' where I didn't expect one, maybe this tag isn't closed <WalkerName>
> ERROR 13:42:55,371 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help
> ERROR 13:42:55,371 QScriptManager -                                                     ^
> ERROR 13:42:55,371 QScriptManager - ExampleCountReads.scala:68: '}' expected but eof found.
> ERROR 13:42:55,371 QScriptManager - }
> ERROR 13:42:55,372 QScriptManager -  ^
> ERROR 13:42:55,390 QScriptManager - 10 errors found
> ##### ERROR ------------------------------------------------------------------------------------------
> ##### ERROR stack trace
>   at org.broadinstitute.sting.queue.QCommandLine.org$broadinstitute$sting$queue$QCommandLineqScriptPluginManager(QCommandLine.scala:94)
> ##### ERROR ------------------------------------------------------------------------------------------
> ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-5-g49ed93c):
> ##### ERROR
> ##### ERROR Please visit the wiki to see if this is a known problem
> ##### ERROR If not, please post the error, with stack trace, to the GATK forum
> ##### ERROR
> ##### ERROR MESSAGE: Compile of /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/ExampleCountReads.scala failed with 10 errors
> ##### ERROR ------------------------------------------------------------------------------------------
> INFO  13:42:55,487 QCommandLine - Shutting down jobs. Please wait...
> 13.462u 1.029s 0:10.41 139.0% 0+0k 0+0io 1pf+0w
>


I am sure java, gatk and queue are installed properly and the files exist in the relevant directory. Thank you,

Dear all , I am using Queue and DataProcessingPipeline.scala ( from https://github.com/broadgsa/gatk/blob/master/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala ) to process my bam file . The input is a sample level bam file which has been processed by BWA aligned , Samtools sampe and Picard merge and duplicate removed . The output bam file (~200GB/sample ) is much larger than the input bam file (~80GB/sample ) . I want to know what information was added into the bam file ? Thanks a lot .My Queue version is Queue-2.1-10 . My script :

java \ -Xmx4g \ -Djava.io.tmpdir=../tmp/ \ -jar ./Queue-2.1-10/Queue.jar \ -S ./DataProcessingPipeline.scala \ -i input.bam \ -R /db/human_g1k_v37.fasta \ -D /db//dbSnp_b137.vcf \ -run

Hi there, I was trying to debug an error in the RScript generated after base recalibration, while running the DataProcessingPipeline.scala (run as it is). I get the following debug output

 [...]
Error in file(filename, "r", blocking = TRUE) :
cannot open the connection
Calls: source ... eval.with.vis -> eval.with.vis -> gsa.read.gatkreport -> file
1: In file(filename, "r", blocking = TRUE) :
cannot open file '/SAN/scratch3/sample378_TTAGGC_L004_R1_001.fastq.pre_recal.table.recal': No such file or directory
Execution halted


no file ending with "recal.table.recal" exists, but the file "recal.table" does exist. I couldn't find any step in the scala script where a ".recal" is added to "recal.table", nor a specific trait or class referring to the RScript itself, as I understand it's part of the walker BaseRecalibrator.

is this a small bug in the name handling, or am I doing something wrong somewhere?

thanks, Francesco

I had a question that, while it might be more appropriate for a BWA or seqanswers audience, I noticed something in the GATK's "Data Processing Pipeline" under "Methods and Workflows" that made me wonder. The pipeline is described here and there's a nice flowchart as well: http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflows#41

The process describes a BAM of reads that are either not aligned or aligned by some process you don't want to use, so the first step seems to be Picard's RevertSam and then a realignment with BWA. I'm wondering why the process described by this GATK document splits it into per-lane BAM files. There doesn't seem to be any process done at the per-lane level other than BWA alignment. I have two guesses.. the first was to allow more parallelization at that step.

But my second guess is that perhaps BWA doesn't play nice with read groups when reading reads from BAM input files. If that is true, that would explain why I'm having trouble with BAM (a single sample, multiple lanes, merged into one file) -> BWA -> realigned-BAM -> GATK (either UnifiedGenotyper or RealignerTargetCreator, etc)--somewhere along the way, read groups are getting lost. So my guess is that the above-described pipeline splits it per-lane so it can manually respecify read groups all over again to BWA?

Is that other people's experience as well?

Also, the Methods and Workflows page describes a Queue script, but there's no link or anything to the actual Queue script. Anyone know where to find it?

Danny

Looking at the DataProcessingPipeline script I noticed that the "outputDir" is only applied to the cohort list file, but not to the actually processed files, which seems a bit inconsistent with the docs, which say "Output path for the processed BAM files". Here is my fix for this:

diff --git a/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala b/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala
index 56f6460..6dd84b5 100755
@@ -249,8 +249,12 @@ class DataProcessingPipeline extends QScript {
// put each sample through the pipeline
for ((sample, bamList) <- sampleBAMFiles) {

-      // BAM files generated by the pipeline
-      val bam        = new File(qscript.projectName + "." + sample + ".bam")
+      // BAM files generated by the pipeline
+      val bam        = if(outputDir.isEmpty())
+                                      new File(qscript.projectName + "." + sample + ".bam")
+                                  else
+                                      new File(outputDir + qscript.projectName + "." + sample + ".bam")
+
val cleanedBam = swapExt(bam, ".bam", ".clean.bam")
val dedupedBam = swapExt(bam, ".bam", ".clean.dedup.bam")
val recalBam   = swapExt(bam, ".bam", ".clean.dedup.recal.bam")
@@ -292,6 +296,15 @@ class DataProcessingPipeline extends QScript {
}

+  // Override the normal swapExt metod by adding the outputDir to the file path by default if it is defined.
+  override
+  def swapExt(file: File, oldExtension: String, newExtension: String) = {
+      if(outputDir.isEmpty())
+         new File(file.getName.stripSuffix(oldExtension) + newExtension)
+      else
+          swapExt(outputDir, file, oldExtension, newExtension);
+  }
+

/****************************************************************************


This of course, puts all the files in the directory specified by outputDir, but I think that this seems reasonable than putting them in the execution directory of the script.

Can the DataProcessingPipeline (as used by Queue.jar) be used to process BAM files in the best recommended way ("Best: multi-sample realignment with known sites and recalibration")

When I try to run the latest version of the DataProcessingPipeline I get the following error message:

##### ERROR MESSAGE: GATK Lite does not support all of the features of the full version: base insertion/deletion recalibration is not supported, please use the --disable_indel_quals argument

However in the recal case class this option is set to be true: this.disable_indel_quals = true

Any idea how to solve this? It seams to me this is a bug, but I cannot find the source for the PrintReads class (guessing that its not in the public code), so I can't check it myself.

Regards, Johan