Tagged with #qscript
2 documentation articles | 0 announcements | 7 forum discussions


Comments (17)

Introduction

Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.

The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

BWA alignment

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)

To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.

Best Practices for Variant Calling

Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.

Problems with Variant Calling with Pacific Biosciences

  • Calling must be more permissive of indels in the data.

You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.

  • Base quality thresholds should be adjusted to the specifics of your data.

Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.

  • Reference bias

To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.

Comments (25)

Please note that the DataProcessingPipeline qscript is no longer available.

The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.

Data Processing Pipeline

The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK.

Introduction

Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced).

Requirements

This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.

Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.

Command-line arguments

Required Parameters

Argument (short-name) Argument (long-name) Description
-i <BAM file / BAM list> --input <BAM file / BAM list> input BAM file - or list of BAM files.
-R <fasta> --reference <fasta> Reference fasta file.
-D <vcf> --dbsnp <dbsnp vcf> dbsnp ROD to use (must be in VCF format).

Optional Parameters

Argument (short-name) Argument (long-name) Description
-indels <vcf> --extra_indels <vcf> VCF files to use as reference indels for Indel Realignment.
-bwa <path> --path_to_bwa <path> The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option)
-outputDir <path> --output_directory <path> Output path for the processed BAM files.
-L <GATK interval string> --gatk_interval_string <GATK interval string> the -L interval string to be used by GATK - output bams at interval only
-intervals <GATK interval file> --gatk_interval_file <GATK interval file> an intervals file to be used by GATK - output bams at intervals

Modes of Operation (also optional parameters)

Argument (short-name) Argument (long-name) Description
-p <name> --project <name> the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam
-knowns --knowns_only Perform cleaning on knowns only.
-sw --use_smith_waterman Perform cleaning using Smith Waterman
-bwase --use_bwa_single_ended Decompose input BAM file and fully realign it using BWA and assume Single Ended reads
-bwape --use_bwa_pair_ended Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads

The Pipeline

Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):

the data processing pipeline

Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:

BWA alignment

This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.

Sample Level Processing

This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration.

Indel Realignment

This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.

Base Quality Score Recalibration

This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate

The Outputs

The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output.

Processed Bam File

The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.

Validation Files

We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.

Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.

Base Quality Score Recalibration Analysis

PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.

Examples

  1. Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)

    java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run

  2. Performing realignment and the full data processing pipeline in one pair-ended bam file

    java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run

No posts found with the requested search criteria.
Comments (7)

Hi,

Thanks to previous replies can run Queue and the relevant walker on a distributed computing server. The question was if I define my scala script to require an argument for the output file, using the -o parameter like so:

            // Required arguments.  All initialized to empty values.
            ....

             @Input(doc="Output file.", shortName="o")
                              var outputFile: File = _

How do I direct the output to pipe the result to a specified directory? Currently I have the code: genotyper.out = swapExt(qscript.bamFile, "bam", outputFile, "unfiltered.vcf")

Currently when I include the string -o /path/to/my/output/files/MyResearch.vcf

The script creates a series of folders within the directory where I execute Queue from. In this case my results were sent to: /Queue-2-8-1-g932cd3a/MyResearch./path/to/my/output/files/MyResearch.unfiltered.vcf

when all I wanted was the output to appear in the path: /path/to/my/output/files/MyResearch.unfiltered.vcf

As always any help is much appreciated.

Comments (12)

I am working on a Queue script that uses the selectVariants walker. Two of the arguments that I am trying to use both use an enumerated type: restrictAllelesTo and selectTypeToInclude. I have tried passing these as strings however I get java type mismatch errors. What is the simplest way to pass these parameters to the selectVariant walker in the qscript?

Comments (2)

Hi,

I am trying to understand how GATK Queue works, and wrote a mini script to do alignment using bwa mem. The test code failed when running.

The error messages: not found: type not found: type ExternalCommonArgs ERROR 21:25:34,464 QScriptManager - case class bwa_mem (inFastQ: File, outBam: File) extends CommandLineFunction with ExternalCommonArgs ERROR 21:25:34,464 QScriptManager - case class bwa_mem (inFastQ: File, outBam: File) extends CommandLineFunction with ExternalCommonArgs

Then I removed ExternalCommonArgs and ran again. The error is: ERROR MESSAGE: Non-file found. Try removing the annotation, change the annotation to @Argument, or extend File with FileExtension: private int temp.bwaThreads: 1

Any suggestions for debugging? My code is attached below.

Thanks


import org.broadinstitute.sting.queue.QScript

class temp extends QScript {

qscript =>

@Input(doc="input fastq file - or list of fastq files", shortName="I", required=true) var input: File = _

@Input(doc="Reference fasta file", shortName="R", required=true) var reference: File = _

@Input(doc="Number of threads BWA should use", fullName="bwa_threads", shortName="bt", required=false) var bwaThreads: Int = 1

val queueLogDir: String = ".qlog/" // Gracefully hide Queue's output

def performAlignment(fq: File): File = { var alignedBam: File = swapExt(fq, ".fq", ".sam") add(bwa_mem(fq, alignedBam)) alignedBam

}

def script() { var outBam: File = performAlignment(input)
}

case class bwa_mem (inFastQ: File, outBam: File) extends CommandLineFunction with ExternalCommonArgs{ @Input(doc="fastq file to be aligned") var fq = inFastQ @Output(doc="output bam file") var bam = outBam def commandLine = "bwa mem -t " + bwaThreads + " " + reference + " " + fq + " > " + bam this.analysisName = queueLogDir + outBam + ".bwamem" this.jobName = queueLogDir + outBam + ".bwamem" } }

Comments (9)

Hi,

So I've finally taken the plunge and migrated our analysis pipeline to Queue. With some great feedback from @johandahlberg, I have gotten to a state where most of the stuff is running smoothly on the cluster.

I'm trying to add Picard's CalculateHSMetrics to the pipeline, but am having some issues. This code:

case class hsmetrics(inBam: File, baitIntervals: File, targetIntervals: File, outMetrics: File) extends CalculateHsMetrics with ExternalCommonArgs with SingleCoreJob with OneDayJob {
    @Input(doc="Input BAM file") val bam: File = inBam
    @Output(doc="Metrics file") val metrics: File = outMetrics
    this.input :+= bam
    this.targets = targetIntervals
    this.baits = baitIntervals
    this.output = metrics
    this.reference =  refGenome
    this.isIntermediate = false
}

Gives the following error message:

ERROR 06:56:25,047 QGraph - Missing 2 values for function:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp' null 'INPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.bam'  'TMP_DIR=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp'  'VALIDATION_STRINGENCY=SILENT'  'OUTPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.preMarkDupsHsMetrics.metrics'  'BAIT_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'TARGET_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'REFERENCE_SEQUENCE=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/bwaindex0.6/exampleFASTA.fasta'  'METRIC_ACCUMULATION_LEVEL=SAMPLE'  
ERROR 06:56:25,048 QGraph -   @Argument: jarFile - jar 
ERROR 06:56:25,049 QGraph -   @Argument: javaMainClass - Main class to run from javaClasspath 

And yeah, is seems that the jar file is currently set to null in the command line. However, MarkDuplicates runs fine without setting the jar:

case class dedup(inBam: File, outBam: File, metricsFile: File) extends MarkDuplicates with ExternalCommonArgs with SingleCoreJob with OneDayJob {
    @Input(doc = "Input bam file") var inbam = inBam
    @Output(doc = "Output BAM file with dups removed") var outbam = outBam
    this.REMOVE_DUPLICATES = true
    this.input :+= inBam
    this.output = outBam
    this.metrics = metricsFile
    this.memoryLimit = 3
    this.isIntermediate = false
}

Why does CalculateHSMetrics need the jar, but not MarkDuplicates? Both are imported with import org.broadinstitute.sting.queue.extensions.picard._.

Comments (2)

Is the a way to access argument tags in the arguments to a Qscript?

I have a script that takes a number of bam files as input and I would like to be able to tag them. i.e.

--input:whole-genome some_long_name.bam --input:exome a_different_bam.bam

In a walker I do this and then look up the tags by calling getToolkit().getTags(argumentValue), but this isn't available to a qscript. Is there a good way to do this?

Comments (8)

I'm working on a set of related Queue scripts. I would like to have functionality shared between them, ideally in separate scala files which would be imported. Is there a way to specify additional paths for the Queue scala compiler to search or do I have to bake my library into the gatk when I build it?

Comments (2)

I'm building a variant calling qscript (it's available here), heavily based on the the MethodsDevelopmentCallingPipeline.scala. I cannot however run into trouble when setting the "this.scatterCount" of the GenotyperBase to more than 1 - in which case I get a NullPointerException (I include the full error message below).

I use the following command line:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/NewVariantCalling.scala -i NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam -R /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta -res /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/ **-sg 2** -nt 8 -run -l DEBUG -startFromScratch

As you can see, I'm using the files from the gatk bundle, and I guess these should be alright for this purpose? Just to be clear I use the "-res" parameter to point to the directory where all the resource files are located, dbsnp, hapmap, etc. and the -sg parameter is what controls the scatter/gather count.

I've tried to search in the code for what might be causing this, and I can conclude that the org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc is called with str (its parameter) being an empty string, which is what causes contig to be null, which in turn creates the NullPointerException on line 408 when this line is executed: stop = getContigInfo(contig).getSequenceLength();

This, I guess, is the obvious stuff, but this far I haven't been able to figure this out any further that this. I'm not sure if this is caused by a bug in my script, or by a bug in the GATK. Right now I'm thinking the latter of the two, since I have used the scatter/gather function in other scripts without any trouble.

Any ideas of where to continue from here, or confirmation that this is indeed something related to the GATK code would be much appreciated.

Cheers, Johan

ERROR 16:22:50,781 FunctionEdge - Error: LocusScatterFunction: List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf.idx, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam) > List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_1_of_2/scatter.intervals, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_2_of_2/scatter.intervals) 
java.lang.NullPointerException
        at org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:408)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:84)
        at org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:97)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.loadIntervals(IntervalUtils.java:538)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindingsPair(IntervalUtils.java:520)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindings(IntervalUtils.java:499)
        at org.broadinstitute.sting.queue.extensions.gatk.GATKIntervals.locs(GATKIntervals.scala:60)
        at org.broadinstitute.sting.queue.extensions.gatk.LocusScatterFunction.run(LocusScatterFunction.scala:39)
        at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:28)
        at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:83)
        at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:432)
        at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:154)
        at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:145)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)