FullCallingPipeline.q

From GSA
Jump to: navigation, search

Warning: the material on this page is considered obsolete and is only maintained for historical purposes.


FullCallingPipeline.q is a script that implements the Broad's current variant-calling practices for small-target, whole-exome, and whole-genome human NGS datasets with deep coverage (>20x). This script performs all of the steps required to go from analysis-ready BAM files (BAMs that have already been aligned, de-duped, recalibrated, and indel-cleaned) to analysis-ready VCF files (called, filtered, annotated, evaluated). It has been used for production-quality variant detection at the Broad Institute on ~7,000 human (non-cancer) samples to date.

Contents

Introduction

Variant detection is a multi-stage process, requiring the application of a variant caller, a set of filters to mark putative variants as true- or false-positives, an annotator to identify functionally interesting variation, and a quality evaluator to assess bulk properties of the callset. These tools can be cumbersome to apply, some requiring parallelization into small pieces and accumulation of the pieces into a whole file in order for the processing time to be reasonable. Many tools can be run simultaneously, greatly decreasing the running time of the pipeline but creating the additional challenge of keeping track of jobs that have been dispatched, others that are waiting to run, and jobs that can only run after another program has completed. Sometimes the tools succumb to transient errors (NFS filesystem disappears for a moment), requiring the pipeline to be restarted. Some programs require an exact specification of parameters in order to achieve best results, a small deviation from which can significantly and adversely affect the quality of the output. And because the GATK is updated so frequently, it can be very difficult to know what the appropriate settings are on a day-to-day basis.

FullCallingPipeline.q is meant to address all of these issues. It is written as a QScript (a simple text file written in the Scala programming language) and run via the custom job dispatcher, GATK-Queue. This language makes it easy to specify complicated pipeline operations, like scatter-gather (parallelize a tool over genomic intervals and merge the results back together). It also provides a simple mechanism for dependencies between jobs to be specified (e.g. "Program_B is only run after Program_A has finished successfully"). It provides an auto-retry functionality and the ability to resume from the point of failure so that transient errors don't require the whole pipeline to be rerun. The QScript contains our current production settings for all of the relevant tools (being used and evaluated every day for the sequencing projects currently being performed at the Broad). The script is updated often to reflect modifications to the GATK programs called within. And because it is a simple script checked into the GATK source code repository, the updated practices are immediately available to the sequencing community.

Running the pipeline

Prerequisites

Required resource files

The FullCallingPipeline requires the following resources (unless otherwise indicated, these are all available in our GATK resource bundle):

  • Reference file: the full path to the human genome fasta reference
  • dbSNP rod for annotation: the full path to the dbSNP file to be used to annotate rs-ids of variants
  • dbSNP rod for evaluation: the full path to the dbSNP file to be used to consider variants as known or novel during metric evaluation (this can be the same as the dbSNP file used for annotation, but often it's useful to have separate files. For instance, annotation with dbSNP 132, but evaluation with dbSNP 129 simply because the metrics when evaluating against dbSNP 129 can be more readily compared to results currently being published).
  • refGene table for annotation: the full path to the refGene table for functional annotation of variants in coding regions. See GenomicAnnotator for instructions on obtaining this file.
  • interval list: the full path to an interval list defining the regions of the genome to process. This can be the whole-genome (one or more chunks per chromosome), or an explicit list of targets from a capture array, or intervals for genes of interest. But it's critical that *some* intervals be specified because the information is used to figure out how to parallelize the jobs.

Project YAML file

The information above needs to be specified in a project YAML file - a file in YAML format that specifies locations of resource files (reference file, dbSNP files, annotation table, etc.), the full paths to the BAM files to be used in calling, and the sample names. The example below shows all of the fields which are required for a complete project YAML.

{
 project: {
   name: YourProjectNameHere,
   referenceFile: /Path/to/your/reference.fasta,
   genotypeDbsnp: /Path/to/your/dbsnp/for/annotation.rod,
   evalDbsnp: /Path/to/your/dbsnp/for/what/should/be/considered/known/variants.vcf,
   refseqTable: /Path/to/your/refseqtable.txt,
   intervalList: /path/to/your/targets.interval_list
  },
 samples: [
   {
     id: MyFirstSamplesUniqueID,
     bamFiles: { cleaned: /path/to/a/bam/that/is/cleaned.bam },
   },
   {
     id: MySecondSamplesUniqueID,
     bamFiles: { cleaned: /path/to/a/different/bam/that/is/cleaned.bam },
   }
 ]
}

A template for yamls can be found here: File:Example.yaml.

Command-line for running the pipeline

The simplest command for running the pipeline is as follows:

java -Djava.io.tmpdir=/path/to/my/temp/directory -jar /path/to/Sting/dist/Queue.jar \
     -S /path/to/Sting/scala/qscript/playground/FullCallingPipeline.q \
     -G /path/to/Sting/dist/GenomeAnalysisTK.jar \
     -Y /path/to/my.yaml \
     -bsub \
     -jobQueue my_LSF_queue \
     -run

QScripts also have a "dry-run" capability in which the commands that will be run are displayed on the screen but not actually executed. This is useful for ensuring that the YAML file is correctly formatted, that the resource files are actually present, etc. In order to see what will happen without running any commands, simply omit the "-run" argument.

Basic command-line arguments

Argument (short-name) Argument (long-name) Description
-Y <yaml file> --yamlfile <yaml file> The project YAML file containing paths to resource files, BAM files, and sample names as described above.
-G <gatk jar> --gatkjar <gatk jar> The full path to GenomeAnalysisTK.jar (necessary to run all of the GATK tools)
-varScatter <number of jobs> --num_var_scatter_jobs <num_var_scatter_jobs> Level of parallelism to employ for UnifiedGenotyper in calling both SNPs and indels. By default, this is set to 20.
-expand <bases to expand by> --expandintervals <bases to expand by> For every target in the interval list, expand it by the specified number of bases on both sides of the target. This is useful when the input interval list is a whole-exome target design, but you want to call the flanking regions as well because there is often sufficient near-target data there. By default, this is set to 50bp. Set this to 0 in order to disable target expansion.
-bsub --bsub A high-level GATK-Queue argument that specifies that jobs should be dispatched to LSF. Note: if you do not specify this argument, jobs will run serially on the local machine, and this pipeline is much too slow to be reasonably be run in serial mode on a reasonably large dataset (50-100 samples, 32-megabase exome).
-jobQueue <job queue name> --job_queue <job queue name> A high-level GATK-Queue argument that specifies the LSF queue to which jobs should be dispatched.

While the most critical arguments are listed above, many more arguments exist. Type

java -Djava.io.tmpdir=/path/to/my/temp/directory -jar /path/to/Sting/dist/Queue.jar \
     -S /path/to/Sting/scala/qscript/playground/FullCallingPipeline.q \
     -h

for a complete listing.

What the FullCallingPipeline.q script does

Steps of the pipeline

A diagram of the steps run by the FullCallingPipeline.

FullCallingPipeline has essentially three stages - generation of indel calls, generation of SNP calls, and finalization (merging, annotating, evaluating) of the callset. These steps are detailed below (and summarized in the diagram on the right):

1. Indels

default (built-in) settings for UnifiedGenotyper
per-sample downsampling to 600x
rs-ids annotated with dbSNP
run many ways parallel on different computers in order to reduce the overall running time
uses QUAL < 30.0 filter
uses SB > -1.0 filter
uses QD < 2 filter

2. SNPs

  • Generation of unfiltered SNP calls using UnifiedGenotyper and its default settings
default (built-in) settings for UnifiedGenotyper
per-sample downsampling to 600x
rs-ids annotated with dbSNP
run many ways parallel on different computers in order to reduce the overall running time.
uses SB >= 0.10 filter
uses QD < 5.0 filter
uses HRun >= 4 filter
uses clustered SNPs (>= 3 SNPs within 10 bp) filter
masks out SNPs that overlap the filtered indels

3. Combined SNPs and indels

  • Combining results into a single VCF containing both SNPs and indels using CombineVariants
unions the indel and SNP VCFs to a single file
uses the refGene annotation table to mark the functional consequence of each coding variant across *all* available transcripts
computes the number of variants, Ti/Tv ratios, and metrics per allele count. If the -expand argument was specified to the pipeline, the metrics are computed separately on the original targets (usually the exons) and the expanded targets (usually the intronic flanking regions).

Implementation notes

Assumptions

The following assumptions are made by this pipeline script. If these assumptions differ greatly from your needs, you may want to consider using this script as a guide for your work, rather than simply invoking it on your data.

  1. The script is designed to dispatch parallel-running jobs via LSF. If you do not have access to an LSF cluster, then GATK-Queue will not be able to execute the jobs in parallel. Depending on your dataset, that could mean that you would have to wait a very, *very* long time for results.
  2. The script is optimized for processing approximately 100 small-target, whole-genome, or whole-exome samples at a time. Attempting to call significantly more samples than this may require more memory than the script is currently hard-coded to request for each job.
  3. BAM files are assumed to have been aligned, de-duped, recalibrated, and indel cleaned prior to the invocation of this script. A legacy option exists to indel-clean the BAM files before the variant calls are made - simply replace the word "cleaned:" in the YAML file with the word "recalibrated:". You will also need to specify the -P (or --picardfixmatesjar) argument with the full path to the Picard FixMateInformation tool.
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox