Tagged with #dataprocessingpipeline
1 documentation article | 0 announcements | 13 forum discussions

Created 2012-07-23 17:05:10 | Updated 2013-03-25 22:18:53 | Tags: dataprocessingpipeline queue workflow pacbio qscript intermediate
Comments (17)


Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.

The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

BWA alignment

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)

To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.

Best Practices for Variant Calling

Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.

Problems with Variant Calling with Pacific Biosciences

  • Calling must be more permissive of indels in the data.

You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.

  • Base quality thresholds should be adjusted to the specifics of your data.

Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.

  • Reference bias

To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.

No posts found with the requested search criteria.

Created 2015-08-09 11:38:20 | Updated | Tags: best-practices dataprocessingpipeline cancer somatic-variants
Comments (2)

Hello, We're currently running a research project involving exome sequencing of tumors and paired normal blood samples from brain cancer patients. I'm relatively new to NGS processing but thanks to this forum (and the excellent documentation provided by the Broad) I was able to put together the following pipeline for this project: I just wanted to check with the Cancer Team that everything looks OK, as well as get any suggestions, if possible. Looking forward to your valuable response, -E

Created 2013-03-15 00:06:36 | Updated | Tags: dataprocessingpipeline queue
Comments (1)

My lab is trying to use Queue, but out pipeline is spawning very large number of jobs (as would be expected). Our lab has very spiky data processing requiernments and it would be useful to be able to limit the number of jobs queue submits at a time so other, non-queue jobs can process in parallel. Is this possible?


Created 2013-03-04 17:18:24 | Updated | Tags: dataprocessingpipeline
Comments (1)

basically, I am using DataProcessingPiepline, and interestingly, given an unaligned bam file which is 6gig, the DataProcessingPipeline sometimes does the all processes of the pipeline but produces a very small output (200K) or most of the times, doesn't produce anything. The log file is as following,

INFO 11:57:55,892 QScriptManager - Compiling 1 QScript INFO 11:58:08,656 QScriptManager - Compilation complete INFO 11:58:09,062 HelpFormatter - ---------------------------------------------------------------------- INFO 11:58:09,062 HelpFormatter - Queue v2.4-3-g2a7af43, Compiled 2013/02/27 12:20:10 INFO 11:58:09,062 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:58:09,063 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:58:09,063 HelpFormatter - Program Args: -S /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-6-gebbba25/DataProcessingPipeline.scala -bwa /broad/software/free/Linux/redhat_5_x86_64/pkgs/bwa_0.5.9/bwa -i /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/xc1094WELa.49487._new.bam -R /medpop/mpg-psrl/Parabase/hg19/ucsc.hg19.fasta -D /medpop/mpg-psrl/Parabase/hg19/dbsnp_137.hg19.vcf -p c1094WELa.49487 -outputDir /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/ -bwape -run INFO 11:58:09,063 HelpFormatter - Date/Time: 2013/03/04 11:58:09 INFO 11:58:09,063 HelpFormatter - ---------------------------------------------------------------------- INFO 11:58:09,063 HelpFormatter - ---------------------------------------------------------------------- INFO 11:58:09,071 QCommandLine - Scripting DataProcessingPipeline INFO 11:58:09,314 QCommandLine - Added 13 functions INFO 11:58:09,314 QGraph - Generating graph. INFO 11:58:09,342 QGraph - Running jobs. INFO 11:58:09,358 QGraph - 0 Pend, 0 Run, 0 Fail, 13 Done INFO 11:58:09,409 QCommandLine - Writing final jobs report... INFO 11:58:09,410 QJobsReporter - Writing JobLogging GATKReport to file /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.txt INFO 11:58:09,432 QJobsReporter - Plotting JobLogging GATKReport to file /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.pdf WARN 11:58:09,452 RScriptExecutor - Skipping: Rscript (resource)org/broadinstitute/sting/queue/util/queueJobReport.R /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.txt /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.pdf INFO 11:58:09,456 QCommandLine - Script completed successfully with 13 total jobs

and the command I am using,

java -Xmx4g -Djava.io.tmpdir=/medpop/mpg-psrl/Parabase/tmp/ -jar /medpop/mpg-psrl/Parabase/Tools/Queue-2.4-3-g2a7af43/Queue.jar -S /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-6-gebbba25/DataProcessingPipeline.scala -bwa /broad/software/free/Linux/redhat_5_x86_64/pkgs/bwa_0.5.9/bwa -i /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/c1094WELa.49487._new.bam -R /medpop/mpg-psrl/Parabase/hg19/ucsc.hg19.fasta -D /medpop/mpg-psrl/Parabase/hg19/dbsnp_137.hg19.vcf -p c1094WELa.49487 -outputDir /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/ -bwape -run > /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/c1094WELa.49487.gatkpipline.log

What could be wrong here, since it doesn't really complain about anything ! Thanks,

Created 2013-01-27 19:59:26 | Updated | Tags: dataprocessingpipeline
Comments (1)

I am using Queue v2.3-5, and GenomeAnalysisTK-2.3-6-gebbba25 for a couple of my Bam files, I have encountered the following error, and I was wondering what could be the reason ...

ERROR 16:14:17,964 FunctionEdge - Error: hg19/ucsc.hg19.fasta /c1193JYUa.49483._new.1.2.sai /c1193JYUa.49483._new.reverted.bam /c1193JYUa.49483._new.reverted.bam > /c1193JYUa.49483._new.1.realigned.sam 
ERROR 16:14:18,264 FunctionEdge - Contents of /medpop/mpg-psrl/Parabase/Projects/Seattle/c1193JYUa.49483._new.1.realigned.sam.out:

my command is as following

java -Xmx4g -Djava.io.tmpdir=/tmp/ -jar /Tools/Queue-2.3-5-g49ed93c/Queue.jar -S /Tools/Queue-2.3-6-gebbba25/DataProcessingPipeline.scala -bwa /broad/software/free/Linux/redhat_5_x86_64/pkgs/bwa_0.5.9/bwa -i /c1193JYUa.49483/c1193JYUa.49483._new.bam -R /hg19/ucsc.hg19.fasta -D hg19/dbsnp_137.hg19.vcf -p /c1193JYUa.49483/ -bwape -run > /c1193JYUa.49483/c1193JYUa.49483.gatkpipline.log

I have shortened the path for the sake of clarity ... thank you

Created 2013-01-11 17:59:38 | Updated 2013-01-11 18:05:57 | Tags: dataprocessingpipeline java
Comments (6)

I am facing a "fatal error by java runtime enviormnet" after using GATK DataProcessingPipeline, my java version is

> java version "1.6.0_35"
> Java(TM) SE Runtime Environment (build 1.6.0_35-b10)
> Java HotSpot(TM) 64-Bit Server VM (build 20.10-b01, mixed mode)

and I am using GATK 2.3-6-gebbba25 The pipeline spit out the following error,

> lelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/medpop/mpg-psrl/Parabase/tmp'  '-cp' '/medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'RealignerTargetCreator'  '-I' '/medpop/mpg-psrl/Parabase/MyRuns/EX_c1004CARa_1ln_hg19._new.1.realigned.rg.bam'  '-R' '/medpop/mpg-psrl/Parabase/hg19/ucsc.hg19.fasta'  '-o' '/medpop/mpg-psrl/Parabase/MyRuns/EX_c1004CARa_1ln_hg19.EX_c1004CARa_1ln_hg19.intervals'  '-known' '/medpop/mpg-psrl/Parabase/hg19/dbsnp_137.hg19.vcf'  '-mismatch' '0.0'  
> INFO  11:32:30,896 FunctionEdge - Output written to /medpop/mpg-psrl/Parabase/MyRuns/EX_c1004CARa_1ln_hg19.EX_c1004CARa_1ln_hg19.intervals.out 
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00002ae7a5a50380, pid=562, tid=47174397446800
> #
> # JRE version: 6.0_35-b10
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.10-b01 mixed mode linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x712380]  SR_handler(int, siginfo*, ucontext*)+0x30
> #
> # An error report file with more information is saved as:
> # /medpop/mpg-psrl/Parabase/MyRuns/hs_err_pid562.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #

the log file is lengthy; but If you like to have a look I can paste here it later on, thanks for your support,

Created 2013-01-10 20:43:28 | Updated 2013-01-10 20:49:09 | Tags: dataprocessingpipeline queue dpp performance community
Comments (3)

Is there any rule of thumb for allocating memory through "bsub" for running DataProcessingPipeline per bam file or per number of reads ?


Created 2013-01-08 18:47:46 | Updated 2013-01-08 19:34:57 | Tags: dataprocessingpipeline queue scala
Comments (16)

I am running the following command to test my scala using GATK-2.3.5

> java -Xmx4g -Djava.io.tmpdir=tmp -jar /Queue-2.3-5-g49ed93c/Queue.jar -S Queue-2.3-5-g49ed93c/ExampleCountReads.scala -R /GATK-2.3-5-g49ed93c/resources/exampleFASTA.fasta -I /GATK-2.3-5-g49ed93c/resources/exampleBAM.bam

and I am getting this error

> INFO  13:42:55,166 QScriptManager - Compiling 1 QScript 
> ERROR 13:42:55,348 QScriptManager - ExampleCountReads.scala:39: in XML literal: '=' expected instead of 'G' 
> ERROR 13:42:55,356 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,356 QScriptManager -                           ^ 
> ERROR 13:42:55,357 QScriptManager - ExampleCountReads.scala:39: in XML literal: ' or " delimited attribute value or '{' scala-expr '}' expected 
> ERROR 13:42:55,358 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,358 QScriptManager -                            ^ 
> ERROR 13:42:55,358 QScriptManager - ExampleCountReads.scala:39: in XML literal: whitespace expected 
> ERROR 13:42:55,359 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,360 QScriptManager -                             ^ 
> ERROR 13:42:55,360 QScriptManager - ExampleCountReads.scala:39: in XML literal: '=' expected instead of '>' 
> ERROR 13:42:55,361 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,362 QScriptManager -                                               ^ 
> ERROR 13:42:55,362 QScriptManager - ExampleCountReads.scala:39: in XML literal: ' or " delimited attribute value or '{' scala-expr '}' expected 
> ERROR 13:42:55,363 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,365 QScriptManager -                                                 ^ 
> ERROR 13:42:55,366 QScriptManager - ExampleCountReads.scala:39: in XML literal: whitespace expected 
> ERROR 13:42:55,367 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,367 QScriptManager -                                                  ^ 
> ERROR 13:42:55,367 QScriptManager - ExampleCountReads.scala:39: in XML literal: '>' expected instead of ' ' 
> ERROR 13:42:55,369 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,369 QScriptManager -                                                   ^ 
> ERROR 13:42:55,369 QScriptManager - ExampleCountReads.scala:67: in XML literal: in XML content, please use '}}' to express '}' 
> ERROR 13:42:55,369 QScriptManager -   } 
> ERROR 13:42:55,369 QScriptManager -   ^ 
> ERROR 13:42:55,370 QScriptManager - ExampleCountReads.scala:39:  I encountered a '}' where I didn't expect one, maybe this tag isn't closed <WalkerName> 
> ERROR 13:42:55,371 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,371 QScriptManager -                                                     ^ 
> ERROR 13:42:55,371 QScriptManager - ExampleCountReads.scala:68: '}' expected but eof found. 
> ERROR 13:42:55,371 QScriptManager - } 
> ERROR 13:42:55,372 QScriptManager -  ^ 
> ERROR 13:42:55,390 QScriptManager - 10 errors found 
> ##### ERROR ------------------------------------------------------------------------------------------
> ##### ERROR stack trace 
> org.broadinstitute.sting.queue.QException: Compile of /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/ExampleCountReads.scala failed with 10 errors
>   at org.broadinstitute.sting.queue.QScriptManager.loadScripts(QScriptManager.scala:46)
>   at org.broadinstitute.sting.queue.QCommandLine.org$broadinstitute$sting$queue$QCommandLine$$qScriptPluginManager(QCommandLine.scala:94)
>   at org.broadinstitute.sting.queue.QCommandLine.getArgumentSources(QCommandLine.scala:225)
>   at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:197)
>   at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
>   at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:61)
>   at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
> ##### ERROR ------------------------------------------------------------------------------------------
> ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-5-g49ed93c):
> ##### ERROR
> ##### ERROR Please visit the wiki to see if this is a known problem
> ##### ERROR If not, please post the error, with stack trace, to the GATK forum
> ##### ERROR Visit our website and forum for extensive documentation and answers to 
> ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
> ##### ERROR
> ##### ERROR MESSAGE: Compile of /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/ExampleCountReads.scala failed with 10 errors
> ##### ERROR ------------------------------------------------------------------------------------------
> INFO  13:42:55,487 QCommandLine - Shutting down jobs. Please wait... 
> 13.462u 1.029s 0:10.41 139.0% 0+0k 0+0io 1pf+0w

I am sure java, gatk and queue are installed properly and the files exist in the relevant directory. Thank you,

Created 2012-11-27 08:17:54 | Updated 2013-01-07 19:51:45 | Tags: dataprocessingpipeline dpp
Comments (3)

Dear all , I am using Queue and DataProcessingPipeline.scala ( from https://github.com/broadgsa/gatk/blob/master/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala ) to process my bam file . The input is a sample level bam file which has been processed by BWA aligned , Samtools sampe and Picard merge and duplicate removed . The output bam file (~200GB/sample ) is much larger than the input bam file (~80GB/sample ) . I want to know what information was added into the bam file ? Thanks a lot .My Queue version is Queue-2.1-10 . My script :

java \ -Xmx4g \ -Djava.io.tmpdir=../tmp/ \ -jar ./Queue-2.1-10/Queue.jar \ -S ./DataProcessingPipeline.scala \ -i input.bam \ -R /db/human_g1k_v37.fasta \ -D /db//dbSnp_b137.vcf \ -run

Created 2012-10-31 16:19:49 | Updated 2012-10-31 17:32:50 | Tags: baserecalibrator dataprocessingpipeline rscript
Comments (6)

Hi there, I was trying to debug an error in the RScript generated after base recalibration, while running the DataProcessingPipeline.scala (run as it is). I get the following debug output

 Error in file(filename, "r", blocking = TRUE) : 
   cannot open the connection
 Calls: source ... eval.with.vis -> eval.with.vis -> gsa.read.gatkreport -> file
 In addition: Warning messages:
 1: In file(filename, "r", blocking = TRUE) :
   cannot open file '/SAN/scratch3/sample378_TTAGGC_L004_R1_001.fastq.pre_recal.table.recal': No such file or directory
  Execution halted

no file ending with "recal.table.recal" exists, but the file "recal.table" does exist. I couldn't find any step in the scala script where a ".recal" is added to "recal.table", nor a specific trait or class referring to the RScript itself, as I understand it's part of the walker BaseRecalibrator.

is this a small bug in the name handling, or am I doing something wrong somewhere?

thanks, Francesco

Created 2012-09-28 17:09:50 | Updated 2013-01-07 20:38:01 | Tags: dataprocessingpipeline dpp community
Comments (2)

I had a question that, while it might be more appropriate for a BWA or seqanswers audience, I noticed something in the GATK's "Data Processing Pipeline" under "Methods and Workflows" that made me wonder. The pipeline is described here and there's a nice flowchart as well: http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflows#41

The process describes a BAM of reads that are either not aligned or aligned by some process you don't want to use, so the first step seems to be Picard's RevertSam and then a realignment with BWA. I'm wondering why the process described by this GATK document splits it into per-lane BAM files. There doesn't seem to be any process done at the per-lane level other than BWA alignment. I have two guesses.. the first was to allow more parallelization at that step.

But my second guess is that perhaps BWA doesn't play nice with read groups when reading reads from BAM input files. If that is true, that would explain why I'm having trouble with BAM (a single sample, multiple lanes, merged into one file) -> BWA -> realigned-BAM -> GATK (either UnifiedGenotyper or RealignerTargetCreator, etc)--somewhere along the way, read groups are getting lost. So my guess is that the above-described pipeline splits it per-lane so it can manually respecify read groups all over again to BWA?

Is that other people's experience as well?

Also, the Methods and Workflows page describes a Queue script, but there's no link or anything to the actual Queue script. Anyone know where to find it?


Created 2012-08-06 09:02:32 | Updated 2012-08-06 09:02:32 | Tags: dataprocessingpipeline
Comments (2)

Looking at the DataProcessingPipeline script I noticed that the "outputDir" is only applied to the cohort list file, but not to the actually processed files, which seems a bit inconsistent with the docs, which say "Output path for the processed BAM files". Here is my fix for this:

diff --git a/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala b/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala
index 56f6460..6dd84b5 100755
--- a/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala
+++ b/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala
@@ -249,8 +249,12 @@ class DataProcessingPipeline extends QScript {
     // put each sample through the pipeline
     for ((sample, bamList) <- sampleBAMFiles) {

-      // BAM files generated by the pipeline
-      val bam        = new File(qscript.projectName + "." + sample + ".bam")
+      // BAM files generated by the pipeline      
+      val bam        = if(outputDir.isEmpty()) 
+                                      new File(qscript.projectName + "." + sample + ".bam")
+                                  else
+                                      new File(outputDir + qscript.projectName + "." + sample + ".bam")
       val cleanedBam = swapExt(bam, ".bam", ".clean.bam")
       val dedupedBam = swapExt(bam, ".bam", ".clean.dedup.bam")
       val recalBam   = swapExt(bam, ".bam", ".clean.dedup.recal.bam")
@@ -292,6 +296,15 @@ class DataProcessingPipeline extends QScript {
     add(writeList(cohortList, cohortFile))

+  // Override the normal swapExt metod by adding the outputDir to the file path by default if it is defined.
+  override
+  def swapExt(file: File, oldExtension: String, newExtension: String) = {
+      if(outputDir.isEmpty())
+         new File(file.getName.stripSuffix(oldExtension) + newExtension)
+      else
+          swapExt(outputDir, file, oldExtension, newExtension);
+  }      


This of course, puts all the files in the directory specified by outputDir, but I think that this seems reasonable than putting them in the execution directory of the script.

Created 2012-08-01 11:26:49 | Updated 2012-08-01 11:26:49 | Tags: dataprocessingpipeline
Comments (4)

Can the DataProcessingPipeline (as used by Queue.jar) be used to process BAM files in the best recommended way ("Best: multi-sample realignment with known sites and recalibration")

Created 2012-07-31 15:13:03 | Updated 2012-07-31 15:13:03 | Tags: dataprocessingpipeline
Comments (7)

When I try to run the latest version of the DataProcessingPipeline I get the following error message:

ERROR MESSAGE: GATK Lite does not support all of the features of the full version: base insertion/deletion recalibration is not supported, please use the --disable_indel_quals argument

However in the recal case class this option is set to be true: this.disable_indel_quals = true

Any idea how to solve this? It seams to me this is a bug, but I cannot find the source for the PrintReads class (guessing that its not in the public code), so I can't check it myself.

Regards, Johan