Tagged with #retired
2 documentation articles | 0 events or announcements | 0 forum discussions


This article is out of date and has been replaced by updated documents:

- Primer on parallelism

- Specific usage recommendations

The old document, for archival purposes:

1. Overview

The MapReduce architecture of the GATK allows most walkers in the GATK to be run in a parallel-processing mode. The GATK supports two basic parallel processing models known as shared memory and scatter-gather.

  • Shared memory parallelism

    Parallelism within a single multi-threading process with access to a large, shared RAM. Shared memory parallelism is stable and supported by many tools that access pileups of bases.

  • Scatter/gather (SG) parallelism

    In SG parallelism, the target genomic regions are divided up into N independent GATK instances that are run separately on a single machine or across a computing cluster. The outputs of each independent walker, are merged together once all are completed. SG works very efficiently in the GATK, provided the output of a walker is independent per site (e.g. Unified Genotyper) or per chromosome (e.g. Indel Realigner). SG parallelism is a completely stable approach in the GATK, and used routinely by the GATK team in processing large data sets; it is also natively supported by GATK-Queue, which automatically scatters and gathers GATK processes given a desired N number of processes to execute simultaneously.

Note that parallel-processing will significantly speed up data processing but may produce statistically insignificant differences. While this non-determinism is not ideal in practice the minute differences have been mathematically meaningless while producing consistent results in a reasonable amount of time for whole genome and whole exome data. However, if absolute determinism is more important than speed we recommend you do not use parallelism with the GATK.

2. Comparison of GATK parallelism options

There are costs and benefits to each type of parallelism in the GATK, as outlined in the following table.

Comparison of standard parallelism approaches in the GATK

Property Shared Memory Scatter/Gather
Stability Stable Stable | Retired in codebase
Applicable walker types By locus and by ROD only. ReadWalkers are not supported. All walker types. ReadWalkers can only be split safely by chromosome in general
Example walkers UnifiedGenotyper, BaseRecalibrator, VariantEval All walkers, including ReadWalkers like IndelRealigner
Scalability Fewer than 32 cores. Each thread operates completely independently, so N threads requires N times more memory than 1 thread alone. Best scaling at 8 or fewer threads. Hundreds of processes. Limited by capabilities of the underlying storage system, in general. Isilon-class storage can run thousands of jobs effectively.
How to enable Use the -nt argument in the Engine, on any walker that supports shared memory parallelism (the engine will tell you if it does not). 1. Provide -L interval lists to the GATK; a different one for each of the N independent GATK tools. For example -L chr1 for first process, and -L chr2 for the second. When all processes have finished, merge the output together, as appropriate (e.g. use MergeSam.jar for BAMs, and cat or CombineVariants for VCFs). 2. Use GATK-Queue to automatically divide up your GATK jobs. For example, using this.scatterCount = 10 argument will result in 10 independent processes.
Pros - Easy to enable. - Single output file merged together by internally by the GATK engine - Efficiently uses multi-core processors sharing a single memory space - Works for all walker types, including ReadWalkers - Scales to hundreds of independent jobs - Easy to enable with single -L argument - Directly supported and managed by GATK-Queue - Totally independent processing per interval -- failed parts can be easily resumed without repeating already successfully processed regions
Cons - Limited to fewer than 32 processors without significant overhead - Limited to cores physically present on the machine, cannot take advantage of computing cluster resources - Does not work for ReadWalkers (Table Recalibrator, Indel Realigner) - Requires manual preparation of sliced genomic intervals for processing (if you aren't using GATK-Queue). - For ReadWalkers and other tools that can only be processed by chromosome, leading to load balancing problems (chr1 is much bigger than chr22) - Sensitive to data density variation over the genome. Dividing chr20 processing in 63 1MB chunks leads to 10x differences in completion times due to data pileups around the centromere, for example. - Must wait until all parts of the scatter have completed before gathering, making the process sensitive to farm scheduling problems. If a job will run for M minutes, but waits Z minutes to start on the farm, the entire SG process will not complete for at least M + Z minutes.

3. Which parallelism option is right for me?

Almost certainly, either shared memory or scatter/gather parallelism is the right choice for your NGS pipeline. Our go-to option for parallelism here at the Broad Institute is S/G, since S/G allows you to cut up your jobs into hundreds of pieces to run on our standard computing farm, using GATK-Queue. When I have a small job that needs to be run quickly, am testing out program options or need a quick VariantEval result, I'm using shared memory parallelism with ~10 threads on a single large computer with a lot (>64 GB) of memory.

Basically, if I have N processors, and I want to choose between shared memory or S/G, here's how I would choose:

  • If all N processors are on a single computer, and my walker supports it, use shared memory parallelism

  • If not, use S/G

4. Shared Memory Parallelism (Stable)

The GATK currently supports a hierarchical version of parallelism. In this form of parallelism, data is divided into shards, each shard is map/reduced independently, and the results are combined with a 'tree reduce' step. While the framework handles much of the heavy lifting of data division required for parallelism, each walker must individually be reviewed to ensure that it isn't tracking state internally in a non-threadsafe way. Many tools support shared memory parallelism, including critical tools such as:

  • UnifiedGenotyper
  • CountCovariates
  • VariantEval

Please review the source to discover if your walker is parallelizable, or attempt to run it with -nt 2 and if it the engine doesn't complain the walker is parallelized.

In shared memory parallelism, each thread operates on a 16 kbp shard of reference sequence in a completely independent memory space from all other threads. Up to 24 threads can run efficiently in this design, but greater parallelism is limited by resource starvation from the single reduce thread and/or memory inefficiency by keeping each thread’s data totally independent. See gatkParallelism performance 082112 these plots for an analysis of the scalability of several key GATK tools as a function of nt.

Enabling n-way parallelism from the command-line

Run the GATK, using the -nt command-line argument to specify the number of threads that the GATK should attempt to use.

[[image:HierarchicalParallelism.png|thumb|Shared memory parallelism architecture]]

Helpful hints: Implementing a walker with parallelism in mind

First, be aware that some walkers may, by design, require a rewrite for complete parallelization.

  • When implementing a standard (non-parallelized) walker, one must implement the reduce method, which combines an individual datum returned by the map function with the aggregate of the prior map calls. When implementing a parallelizable walker, one must also implement the org.broadinstitute.sting.gatk.walkers.TreeReducible interface and the treeReduce() function. The TreeReduce function tells the GATK how to combine two adjacent reduces, rather than one map result and one reduce.

  • The GATK supports writing to output files from either the map or the reduce when running in parallel. However, only unbuffered writers are currently supported. Please use PrintStream rather than PrintWriter at this time.

Limitations

The GATK's support for parallelism is currently limited. The following classes of walkers are not supported by our parallelization framework:

  • Read walkers
  • Reduce-by-interval walkers

Note that each thread operates completely independently in the current GATK implementation of shared memory parallelism. So if you provide 1Gb to the GATK with nt 1, then you should provide 4Gb to run with nt 4. If you don't do this, it is possible to starve out the GATK so that it runs much much slower.

The performance of the multi-threaded GATK is really dependent on whether you are IO or CPU limited and the relative overhead of the parallelism on your computer. Additionally, nt can start to have very high overheads with nt > 20 due to our implementation and memory contention issues.

The best option for nt is a value less or equal to the number of available cores with sufficient memory to run each threads (nt x amount provided to 1 core), capped additionally by the available IO bandwidth so that the individual threads don't starve each other.

5. Scatter / gather parallelism

Scatter / gather is a simple approach for process-independent parallelism with the GATK. First you scatter multiple independent GATK instances out over a network of computing nodes, each operating on slightly different genomic intervals, and when they all complete, the output of each is gathered up into a merged output dataset. In the GATK S/G is extremely easy to use, as all of the GATK tools support the -L option to operate over only genomic specific intervals, and many tools emit files that can be merged together across multiple regions. Unified Genotyper, for example, can operate over the whole genome in one process, or on each chromosome independently. The output of this later approach, after merging together, should be the same as the whole genome results, minus any slight differences due to random number effects. The simplicity and efficiency of S/G parallelism makes this a key option for getting things done quickly with the GATK.

Using S/G parallelism is extremely easy, either by hand or using the automated Scatter/Gather in GATK-Queue. Suppose I have the following command line:

java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1

This runs a single process of the GATK over chr1, and let's say it takes an hour when I run it. In order to run it with S/G parallelism mode, just partition chr1 into two independent parts:

This file distributed.tracker.txt will contain genomic locations and GATK process ids that are processing each location, in text format, so you can cat it. If you run at the command line:

gsa1> java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1:1-125,000,000 -o my.1.vcf &
gsa1> java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1:125,000,001-249,250,621 -o my.2.vcf &

When these two jobs finish, I just merge the two VCFs together and I've got a complete data set in half the time.

Data Processing Pipeline Script

The Data Processing Pipeline is a Queue script that performs all raw data processing described in this page following our most current stable recommendations for best practices.

Introduction

Our current best practice for making SNP and indel calls is divided into 4 sequential steps: initial mapping, refinement of the initial reads, multi-sample indel and SNP calling, and finally variant quality score recalibration. These steps are the same for targeted resequencing, whole exomes, deep whole genomes, and low-pass whole genomes. The exact commands for each tool are available on the individual tool's wiki entry. See here for a visual representation of the workflow.

Note that, due to the specific attributes of a project, that the the specific values used in each of the commands may need to be selected by the analyst. Care should be taken by the analyst running our tools to understand what each parameter does and to evaluate which value best fits his/her data.

Lanes, Samples, Cohort

There are four major organizational units for next-generation DNA sequencing processes:

Lane 
The basic machine unit for sequencing. The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane.
Library 
A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes.
Sample 
A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Here we treat samples as independent individuals whose genome sequence we are attempting to determine. From this perspective, tumor / normal samples are different despite coming from the same individual.
Cohort 
A collection of samples being analyzed together. This organizational unit is the most subjective and depends intimately on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many samples (e.g., ESP with 800 EOMI samples) deeply sequenced we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.

This document describes how to call variation within a single analysis cohort, comprised for one or many samples, each of one or many libraries that were sequenced on at least one lane of an NGS machine.

Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost by running these commands across all of your data simultaneously.

Testing data: 64x HiSeq on chr20 for NA12878

In order to help individuals get up to speed, evaluate their command lines, and generally become familiar with the GATK tools we recommend you download the raw and realigned, recalibrated NA12878 test data. It should be possible to apply all of the approaches outlined below to get excellent results for realignment, recalibration, SNP calling, indel calling, filtering, and variant quality score recalibration using this data.

Phase I: Raw data processing

Initial read mapping

The GATK data processing pipeline assumes that one of the many NGS read aligners (see [1] for a review) has been applied to your raw FASTQ files. For Illumina data we recommend BWA because it is accurate, fast, well-supported, open-source, and emits BAM files natively.


Raw BAM to realigned, recalibrated BAM

The three key tools here are Base quality score recalibration, Local realignment around indels, and MarkDuplicates. Although ideally one follows the recommended workflow, in practice MarkDuplicates can be run before local realignment, in order to handle cases where duplicates overlapping indels get marginally different alignments (unlikely but possible) and so will not be considered as potential duplicates because MarkDuplicates looks only at read pair start and stop positions. There are several options here, from the easy and fast basic protocol to the more.

There are two types of realignment:

  • Realignment only at known sites, which is very efficient, can operate with little coverage (1x per lane genome wide) but can only realign reads at known indels.
  • Fully local realignment uses mismatching bases to determine if a site should be realigned, and relies on sufficient coverage to discover the correct indel allele in the reads for alignment. It is much slower (involves SW step) but can discover new indel sites in the reads. If you have a database of known indels (for human, this database is extensive) then at this stage you would also include these indels during realignment, which vastly improves sensitivity, specificity, and speed.

Previous recommendation: lane-level recalibration, sample-level realignment

This is the protocol we used at the Broad Institute for the last year.

for each lane.bam
    dedup.bam <- MarkDuplicate(lane.bam)
    recal.bam <- recal(dedup.bam)

for each sample
    lanes.bam <- merged recal.bam's for sample
    dedup.bam <- MarkDuplicates(lanes.bam)
    realigned.bam <- realign(dedup.bam) [with known sites if possible]

Fast: lane-level realignment at known sites only and lane-level recalibration

This protocol adds lane-level local realignment around known indels, making it very fast (no sample level processing) and gives better results for human samples than the previous recommendation:

for each lane.bam
    realigned.bam <- realign(lane.bam) [at only known sites]
    dedup.bam <- MarkDuplicate(realigned.bam)
    recal.bam <- recal(dedup.bam)

for each sample
    recals.bam <- merged lane-level recal.bam's for sample
    dedup.bam <- MarkDuplicates(recals.bam)

Fast + sample-level realignment

This protocol adds sample-level realignment after library / sample level dedupping, so that novel indels in each sample can be discovered and realigned around.

for each lane.bam
    realigned.bam <- realign(lane.bam) [at only known sites]
    dedup.bam <- MarkDuplicate(realigned.bam)
    recal.bam <- recal(dedup.bam)

for each sample
    recals.bam <- merged lane-level recal.bam's for sample
    dedup.bam <- MarkDuplicates(recals.bam)
    realigned.bam <- realign(dedup.bam) [with known sites included if available]

Better: sample-level realignment with known indels and recalibration

Rather than doing the lane level cleaning and recalibration, this process aggregates all of the reads for each sample and then does a full dedupping, realign, and recalibration, yielding the best single-sample results. The big change here is sample-level cleaning followed by recalibration, giving you the most accurate quality scores possible for a single sample.

for each sample
    lanes.bam <- merged lane.bam's for sample
    dedup.bam <- MarkDuplicates(lanes.bam)
    realigned.bam <- realign(dedup.bam) [with known sites included if available]
    recal.bam <- recal(realigned.bam)

Best: multi-sample realignment with known sites and recalibration

Finally, if you really want to get the absolute best results, whatever the computational cost, then we recommend doing multiple sample realignment so that novel indels in one sample help to realign reads in other samples. Although not generally necessary for deep sequencing data, this is important for low-coverage multi-sample SNP calling projects like the 1000 Genomes Project. Note that the computational cost here is so extreme that we only do this analysis in special circumstances, such as large-scale data freeze for the project.

Note that for contrastive calling projects -- such as cancer tumor/normals -- that we recommend cleaning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

for each sample
    lanes.bam <- merged lane.bam's for sample
    dedup.bam <- MarkDuplicates(lanes.bam)

samples.bam <- merged dedup.bam's for all samples
realigned.bam <- realign(samples.bam)
recal.bam <- recal(realigned.bam)

Misc. notes on the process

  • MarkDuplicates needs only be run at the library level. So the sample-level dedupping isn't necessary if you only ever a library on a single lane. If you run the sample library on many lanes (as can be necessary for whole exome, for example), you should dedup at the library level.
  • The base quality score recalibrator is read group aware, so running it on a merged BAM files containing multiple read groups is the same as running it on each bam file individually. There's some memory cost (so it's best not to recalibrate 10000 RGs simultaneously) but for reasonable projects this is a fine.
  • Local realignment preserves read meta-data, so you can realign and then recalibrate just fine.

Initial variant discovery and genotyping

Input BAMs for variant discovery and genotyping

After the raw data processing step, the GATK variant detection process assumes that you have aligned, duplicate marked, and recalibrated BAM files for all of the samples in your cohort. Because the GATK can dynamically merge BAM files, it isn't critical to have merged files by lane into sample bams, or even samples bams into cohort bams. In general we try to create sample level bams for deep data sets (deep WG or exomes) and merged cohort files by chromosome for WG low-pass. A nice size for BAMs is 10-300 Gb, just for organizing on disk.

For this part of the this document, I'm going to assume that you have a single realigned, recalibrated, dedupped BAM per sample, called sampleX.bam, for X from 1 to N samples in your cohort. Note that some of the data processing steps, such as multiple sample local realignment, will merge BAMS for many samples into a single BAM. If you've gone down this route, you just need to modify the GATK commands as necessary to take not multiple BAMs, one for each sample, but a single BAM for all samples.

Multi-sample SNP and indel calling

The next step in the standard GATK data processing pipeline, whole genome or targeted, deep or shallow, is to apply the Unified genotyper to identify sites among the cohort samples that are statistically non-reference. This will produce a multi-sample VCF file, with sites discovered across samples and genotypes assigned to each sample in the cohort. It's in this stage that we use the meta-data in the BAM files extensively -- read groups for reads, with samples, platforms, etc -- to enable us to do the multi-sample merging and genotyping correctly. It was a pain for data processing, yes, but now life is easy for downstream calling and analysis.

Note that by default the Unified Genotyper calls SNPs only. To enable the indel calling capabilities (in addition to SNPs) instead use the -glm BOTH argument.

Selecting an appropriate quality score threshold

A common question is the confidence score threshold to use for variant detection. We recommend:

Deep (> 10x coverage per sample) data 
we recommend a minimum confidence score threshold of Q30.
Shallow (< 10x coverage per sample) data 
because variants have by necessity lower quality with shallower coverage, we recommend a min. confidence score of Q4 in project with 100 samples or fewer and Q10 otherwise.

Protocol

raw.vcf <- unifiedGenotyper(sample1.bam, sample2.bam, ..., sampleN.bam)

Integrating analyses: getting the best call set possible

This raw VCF file should be as sensitive to variation as you'll get without imputation. At this stage, you can assess things like sensitivity to known variant sites or genotype chip concordance. The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference. All of the subsequent steps are designed to separate out the FP machine artifacts from the TP genetic variants.

Whole Genome Shotgun experiments

The tool used here is the Variant quality score recalibration which builds an adaptive error model using known variant sites and then apply this model to estimate the probability that each variant is a true genetic variant or a machine artifact. The tool builds a separate model for SNPs and indels and must be run twice in succession in order to recalibrate both classes of variants. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Analysis ready VCF protocol
snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)

Common VariantRecalibrator command

Please review the Variant quality score recalibration wiki page for details of how one specifies truth and training sets.

java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R path/to/reference/human_g1k_v37.fasta \
   -input,VCF raw.vcf \
   [SPECIFY TRUTH AND TRAINING SETS]
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -rscriptFile path/to/output.plots.R
   [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING]
SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. These datasets are available in the GATK resource bundle.

Arguments for VariantRecalibrator command:

   -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
   -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
   -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
   -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -an DP \
   -mode SNP \

Note that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, MQ, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, note that some of these annotations might not be the best for your particular dataset. For example, InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated.

Using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

Arguments for VariantReacalibrator:

   -resource:mills,VCF,known=true,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
   -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff \
   -mode INDEL \

Note that indels use a different set of annotations than SNPs. The annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner.

Common ApplyRecalibration command

The Variant quality score recalibration wiki page has an example command for the ApplyRecalibration step which works for both SNPs and indels. One would just need to add -mode SNP (the default) or -mode INDEL to the command line.

Whole Exome experiments

For exome SNPs, the tool used here, as in the whole genome case, is the Variant quality score recalibration which builds an adaptive error model using known variant sites and then apply this model to estimate the probability that each variant is a true genetic variant or a machine artifact. In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

  • Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline)
  • Use the VQSR with the smaller callset but experiment with the precise argument settings (try adding --maxGaussians 4 --percentBad 0.05 to your command line, for example)
  • Use hard filters (detailed below).

For exome indels there generally isn't enough data to build a robust error model and so we recommend applying hand filters to the exome indels. The protocol is to first use SelectVariants to separate out SNPs and indels. Then recalibrate the SNPs and hand filter the indels in parallel. Finally, use CombineVariants to combine the high quality variants back into one VCF file.

Analysis ready VCF protocol
raw.snps.vcf <- Select(raw.vcf, SNP)
raw.indels.vcf <- Select(raw.vcf, INDEL)
snp.model <- BuildErrorModelWithVQSR(raw.snps.vcf)
recalibratedSNPs.vcf <- ApplyRecalibration(raw.snps.vcf, snp.model)
filteredIndels.vcf <- Filter(raw.indels.vcf)
analysisReady.vcf <- CombineTogether(recalibratedSNPs.vcf, filteredIndels.vcf)
SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. These datasets are available in the GATK resource bundle.

Arguments for VariantRecalibrator command:

   --maxGaussians 6 \
   -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
   -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
   -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
   -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
   -mode SNP \

Note that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, MQ, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, note that some of these annotations might not be the best for your particular dataset. For example, InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. Additionally, notice that DP was removed when working with hybrid capture datasets since there is extreme variation in the depth to which targets are captured. In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

Common ApplyRecalibration command

The Variant quality score recalibration wiki page has an example command for the ApplyRecalibration step.

Indel specific recommendations

Arguments for VariantFiltrationWalker:

  --filterExpression "QD < 2.0" \
  --filterExpression "ReadPosRankSum < -20.0" \
  --filterExpression "InbreedingCoeff < -0.8" \
  --filterExpression "FS > 200.0" \
  --filterName QDFilter \
  --filterName ReadPosFilter \
  --filterName InbreedingFilter \
  --filterName FSFilter \

Note the InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.

Making analysis ready SNP and indel calls with hand filtering when VQSR is not possible

Variant quality score recalibration requires a reasonable amount of data to work properly so with targeted resequencing of a small region (for example, a few hundred genes) VQSR may be inappropriate leaving hand filtering as the only option.

Arguments for VariantFiltrationWalker:

  --filterExpression filter1
  --filterName filterName1
  --filterExpression filter2
  --filterName filterName2
  .
  .
  .

where DATA_TYPE_SPECIFIC_FILTERS has project-specific filtering strings selected below:

  • For SNPs
    • DATA_TYPE_SPECIFIC_FILTERS should be "QD < 2.0", "MQ < 40.0", "FS > 60.0", "HaplotypeScore > 13.0", "MQRankSum < -12.5", "ReadPosRankSum < -8.0".
  • For Indels
    • DATA_TYPE_SPECIFIC_FILTERS should be "QD < 2.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "FS > 200.0".

Note that the InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.

The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that For exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.

That said, all of the caveats about determining the right parameters, etc, are annoying and are largely eliminated by Variant quality score recalibration.

Expected SNP call quality

All of the following tables were generated using VariantEval. You can generate your own data points for comparing with that tool and the correct input / comparison VCF files.

Summary results for deep whole genome, multi-sample low-pass, and whole exome

The associated table (see attachment) provides some expectations for running deep whole genomes, single whole exomes, and multi-sample low-pass (from the 1000 genomes) in terms of sensitivity, specificity, and Ti/Tv ratios for known and novel calls. Obviously individual data sets will be different but this demonstrates that running the pipeline described here ( realigned -> recalibration -> calling -> filtering -> variant quality score recalibration) can produce excellent results for a variety of data types. Note that the deep data sets are single sample; in our hands multi-sample deep data results in much better calls overall.

Expected Ti/Tv ratios

We provide two useful data points (see attachment) for evaluating the quality of SNP calls whole genome or in the targeted whole exome (Agilent). You can follow a similar methodology to establish the Ti/Tv expectations for your region of interest, if captured separately, by running VariantEval on the 1000 Genomes Trios or Complete Genomes or other highly reliable data sets given the targeted interval.


Sorry, there are no publicly available documents of this type with the tag #retired. Try one of the other types.
Sorry, there are no publicly available documents of this type with the tag #retired. Try one of the other types.