Tagged with #basic
33 documentation articles | 0 events or announcements | 0 forum discussions


1. Introduction

The AlignmentContext and ReadBackedPileup work together to provide the read data associated with a given locus. This section details the tools the GATK provides for working with collections of aligned reads.

2. What are read backed pileups?

Read backed pileups are objects that contain all of the reads and their offsets that "pile up" at a locus on the genome. They are the basic input data for the GATK LocusWalkers, and underlie most of the locus-based analysis tools like the recalibrator and SNP caller. Unfortunately, there are many ways to view this data, and version one grew unwieldy trying to support all of these approaches. Version two of the ReadBackedPileup presents a consistent and clean interface for working pileup data, as well as supporting the iterable() interface to enable the convenient for ( PileupElement p : pileup ) for-each loop support.

3. How do I get a ReadBackedPileup and/or how do I create one?

The best way is simply to grab the pileup (the underlying representation of the locus data) from your AlignmentContext object in map:

public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context)
    ReadBackedPileup pileup = context.getPileup();

This aligns your calculations with the GATK core infrastructure, and avoids any unnecessary data copying from the engine to your walker.

If you are trying to create your own, the best constructor is:

public ReadBackedPileup(GenomeLoc loc, ArrayList<PileupElement> pileup )

requiring only a list, in order of read / offset in the pileup, of PileupElements.

From List and List

If you happen to have lists of SAMRecords and integer offsets into them you can construct a ReadBackedPileup this way:

public ReadBackedPileup(GenomeLoc loc, List<SAMRecord> reads, List<Integer> offsets )

4. What's the best way to use them?

Best way if you just need reads, bases and quals

for ( PileupElement p : pileup ) {
  System.out.printf("%c %c %d%n", p.getBase(), p.getSecondBase(), p.getQual());
  // you can get the read itself too using p.getRead()
}

This is the most efficient way to get data, and should be used whenever possible.

I just want a vector of bases and quals

You can use:

public byte[] getBases()
public byte[] getSecondaryBases()
public byte[] getQuals()

To get the bases and quals as a byte[] array, which is the underlying base representation in the SAM-JDK.

All I care about are counts of bases

Use the follow function to get counts of A, C, G, T in order:

public int[] getBaseCounts()

Which returns a int[4] vector with counts according to BaseUtils.simpleBaseToBaseIndex for each base.

Can I view just the reads for a given sample, read group, or any other arbitrary filter?

The GATK can very efficiently stratify pileups by sample, and less efficiently stratify by read group, strand, mapping quality, base quality, or any arbitrary filter function. The sample-specific functions can be called as follows:

pileup.getSamples();
pileup.getPileupForSample(String sampleName);

In addition to the rich set of filtering primitives built into the ReadBackedPileup, you can supply your own primitives by implmenting a PileupElementFilter:

public interface PileupElementFilter {
    public boolean allow(final PileupElement pileupElement);
}

and passing it to ReadBackedPileup's generic filter function:

public ReadBackedPileup getFilteredPileup(PileupElementFilter filter);

See the ReadBackedPileup's java documentation for a complete list of built-in filtering primitives.

Historical: StratifiedAlignmentContext

While ReadBackedPileup is the preferred mechanism for aligned reads, some walkers still use the StratifiedAlignmentContext to carve up selections of reads. If you find functions that you require in StratifiedAlignmentContext that seem to have no analog in ReadBackedPileup, please let us know and we'll port the required functions for you.

Introduction

1. The basic workflow

Our current best practice for making SNP and indel calls is divided into four sequential steps: initial mapping, refinement of the initial reads, multi-sample indel and SNP calling, and finally variant quality score recalibration. These steps are the same for targeted resequencing, whole exomes, deep whole genomes, and low-pass whole genomes.

Example commands for each tool are available on the individual tool's wiki entry. There is also a list of which resource files to use with which tool.

Note that due to the specific attributes of a project the specific values used in each of the commands may need to be selected/modified by the analyst. Care should be taken by the analyst running our tools to understand what each parameter does and to evaluate which value best fits the data and project design.

2. Lane, Library, Sample, Cohort

There are four major organizational units for next-generation DNA sequencing processes that used throughout this documentation:

  • Lane: The basic machine unit for sequencing. The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane.
  • Library: A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes.
  • Sample: A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Here we treat samples as independent individuals whose genome sequence we are attempting to determine. From this perspective, tumor / normal samples are different despite coming from the same individual.
  • Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends intimately on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many samples (e.g., ESP with 800 EOMI samples) deeply sequenced we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.

This document describes how to call variation within a single analysis cohort, comprised for one or many samples, each of one or many libraries that were sequenced on at least one lane of an NGS machine.

Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost by running these commands across all of your data simultaneously.

3. Testing data: 64x HiSeq on chr20 for NA12878

In order to help individuals get up to speed, evaluate their command lines, and generally become familiar with the GATK tools we recommend you download the raw and realigned, recalibrated NA12878 test data from the GATK resource bundle. It should be possible to apply all of the approaches outlined below to get excellent results for realignment, recalibration, SNP calling, indel calling, filtering, and variant quality score recalibration using this data.

4. Where can I find out more about the new GATK 2.0 tools you are talking about?

In our GATK 2.0 slide archive.


Phase I: Raw data processing

1. Raw FASTQs to raw reads via mapping

The GATK data processing pipeline assumes that one of the many NGS read aligners (see this review) has been applied to your raw FASTQ files. For Illumina data we recommend BWA because it is accurate, fast, well-supported, open-source, and emits SAM files natively (which are then easy to convert to BAM).

2. Raw reads to analysis-ready reads

The three key processes used here are:

  • Local realignment around indels: Reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs. We look for the most consistent placement of the reads with respect to the indel in order to clean up these artifacts.
  • MarkDuplicates: Duplicately sequenced molecules shouldn't be counted as additional evidence for or against a putative variant. By marking these reads as duplicates the algorithms in the GATK know to ignore them.
  • Base quality score recalibration: The per-base estimate of error known as the base quality score is the foundation upon which all statistically calling algorithms are based. We've found that the estimates provided by the sequencing machines are often inaccurate, and worse, biased. Through recalibration an empirically accurate error model is assigned to the bases to create an analysis-ready bam file. Note: if you have old data that has been recalibrated with an old version of BQSR, you need to rerun your data with the new version so insertion and deletion qualities can be added to your recalibrated BAM file.

There are several options here from the easy and fast basic protocol to the more comprehensive but computationally expensive pipeline. For example, there are two types of realignment which constitute a vastly different amount of processing power required:

  • Realignment only at known sites, which is very efficient, can operate with little coverage (1x per lane genome wide) but can only realign reads at known indels.
  • Fully local realignment uses mismatching bases to determine if a site should be realigned, and relies on sufficient coverage to discover the correct indel allele in the reads for alignment. It is much slower (involves SW step) but can discover new indel sites in the reads. If you have a database of known indels (for human, this database is extensive) then at this stage you would also include these indels during realignment, which vastly improves sensitivity, specificity, and speed.

Fast: lane-level realignment (at known sites only) and lane-level recalibration

This protocol uses lane-level local realignment around known indels (very fast, as there's no sample level processing) to clean up lane-level alignments. This results in better quality scores, as they are less biased for indel alignment artefacts.

for each lane.bam
    dedup.bam <- MarkDuplicate(lane.bam)
    realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
    recal.bam <- recal(realigned.bam)

Fast + per-sample processing

Here we are essentially just merging the recalibrated lane.bams for a sample, dedupping the reads, and calling it quite. It doesn't perform indel realignment across lanes, so it leaves in some indels artifacts. For humans, which now have an extensive list of indels (get them from the GATK bundle!) the lane-level realignment around known indels is going to make up for the lack of cross-lane realignment. This protocol is appropriate if you are going to use callers like the HaplotypeCaller, UnifiedGenotyper with BAQ, or samtools with BAQ that are less sensitive to the initial alignment of reads, or if your project has limited coverage per sample (< 8x) where per-sample indel realignment isn't more empowered than per-lane realignment. For other situations or for organisms with limited database of segregating indels, it's better to use the advanced protocol if you have deep enough data per sample.

for each sample
    recals.bam <- merged lane-level recal.bams for sample
    dedup.bam <- MarkDuplicates(recals.bam)
    sample.bam <- dedup.bam

Better: recalibration per lane then per-sample realignment with known indels

As with the basic protocol, this protocol assumes the per-lane processing has been already completed. This protocol is essentially the basic protocol but with per-sample indel realignment.

for each sample
    recals.bam <- merged lane-level recal.bams for sample
    dedup.bam <- MarkDuplicates(recals.bam)
    realigned.bam <- realign(dedup.bam) [with known sites included if available]
    sample.bam <- realigned.bam

This is the protocol we use at the Broad in our fully automated pipeline because it gives an optimal balance of performance, accuracy and convenience.

Best: per-sample realignment with known indels then recalibration

Rather than doing the lane level cleaning and recalibration, this process aggregates all of the reads for each sample and then does a full dedupping, realign, and recalibration, yielding the best single-sample results. The big change here is sample-level cleaning followed by recalibration, giving you the most accurate quality scores possible for a single sample.

for each sample
    lanes.bam <- merged lane.bams for sample
    dedup.bam <- MarkDuplicates(lanes.bam)
    realigned.bam <- realign(dedup.bam) [with known sites included if available]
    recal.bam <- recal(realigned.bam)
    sample.bam <- recal.bam

This protocol can be hard to implement in practice unless you can afford to wait until all of the data is available to do data processing for your samples.

Misc. notes on the process

  • MarkDuplicates needs only be run at the library level. So the sample-level dedupping isn't necessary if you only ever have a library on a single lane. If you run the sample library on many lanes (as can be necessary for whole exome, for example), you should dedup at the library level.
  • The base quality score recalibrator is read group aware, so running it on a merged BAM files containing multiple read groups is the same as running it on each bam file individually. There's some memory cost (so it's best not to recalibrate many read groups simultaneously) but for reasonable projects this is fine.
  • Local realignment preserves read meta-data, so you can realign and then recalibrate just fine.
  • Multi-sample realignment with known sites and recalibration isn't really recommended any longer. It's extremely computational expensive and isn't necessary for advanced callers with advanced filters like the Unified Genotyper / HaplotypeCaller and VQSR. It's better to use one of the protocols above and then an advanced caller that is robust to indel artifacts.
  • However, note that for contrastive calling projects -- such as cancer tumor/normals -- we recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

3. Reducing BAMs to minimize file sizes and improve calling performance

ReduceReads is a novel (perhaps even breakthrough?) GATK 2.0 data compression algorithm. The purpose of ReducedReads is to take a BAM file with NGS data and reduce it down to just the information necessary to make accurate SNP and indel calls, as well as genotype reference sites (hard to achieve) using GATK tools like UnifiedGenotyper or HaplotypeCaller. ReduceReads accepts as an input a BAM file and produces a valid BAM file (it works in IGV!) but with a few extra tags that the GATK can use to make accurate calls.

You can find more information about reduced reads in some of our presentations in the archive.

ReduceReads works well for exomes or high-coverage (at least 20x average coverage) whole genome BAM files. In this case we highly recommend using ReduceReads to minimize the file sizes. Note that ReduceReads performs a lossy compression of the sequencing data that works well with the downstream GATK tools, but may not be supported by external tools. Also, we recommend that you archive your original BAM file, or at least a copy of your original FASTQs, as ReduceReads is highly lossy and doesn't quality as an archive data compression format.

Using ReduceReads on your BAM files will cut down the sizes to approximately 1/100 of their original sizes, allowing the GATK to process tens of thousands of samples simultaneously without excessive IO and processing burdens. Even for single samples ReduceReads cuts the memory requirements, IO burden, and CPU costs of downstream tools significantly (10x or more) and so we recommend you preprocess analysis-ready BAM files with ReducedReads.

for each sample
    sample.reduced.bam <- ReduceReads(sample.bam)

Phase II: Initial variant discovery and genotyping

1. Input BAMs for variant discovery and genotyping

After the raw data processing step, the GATK variant detection process assumes that you have aligned, duplicate marked, and recalibrated BAM files for all of the samples in your cohort. Because the GATK can dynamically merge BAM files, it isn't critical to have merged files by lane into sample bams, or even samples bams into cohort bams. In general we try to create sample level bams for deep data sets (deep WG or exomes) and merged cohort files by chromosome for WG low-pass.

For this part of the this document, I'm going to assume that you have a single realigned, recalibrated, dedupped BAM per sample, called sampleX.bam, for X from 1 to N samples in your cohort. Note that some of the data processing steps, such as multiple sample local realignment, will merge BAMS for many samples into a single BAM. If you've gone down this route, you just need to modify the GATK commands as necessary to take not multiple BAMs, one for each sample, but a single BAM for all samples.

2. Multi-sample SNP and indel calling

The next step in the standard GATK data processing pipeline, whole genome or targeted, deep or shallow, is to apply the Haplotype Caller or Unified Genotyper to identify sites among the cohort samples that are statistically non-reference. This will produce a multi-sample VCF file, with sites discovered across samples and genotypes assigned to each sample in the cohort. It's in this stage that we use the meta-data in the BAM files extensively -- read groups for reads, with samples, platforms, etc -- to enable us to do the multi-sample merging and genotyping correctly. It was a pain for data processing, yes, but now life is easy for downstream calling and analysis.

Selecting an appropriate quality score threshold

A common question is the confidence score threshold to use for variant detection. We recommend:

  • Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.
  • Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.

Standard protocol: HaplotypeCaller

raw.vcf <- HaplotypeCaller(sample1.bam, sample2.bam, ..., sampleN.bam)

Standard protocol: UnifiedGenotyper

raw.vcf <- UnifiedGenotyper(sample1.bam, sample2.bam, ..., sampleN.bam)

Choosing HaplotypeCaller or UnifiedGenotyper

  • We believe the best possible caller in the GATK is the HaplotypeCaller, which combines a local de novo assembler with a more advanced HMM likelihood function than the UnifiedGenotyper. It should produce excellent SNP, MNP, indel, and short SV calls. It should be the go-to calling algorithm for most projects. It is, for example, how we make our Phase II call set for 1000 Genomes.
  • Currently the HaplotypeCaller only supports diploid calling. If you want to call non-diploid samples you'll need to use the UnifiedGenotyper.
  • At the moment the HaplotypeCaller does not support multithreading. For now you should indeed stick with the UG if you wish to use the -nt option. However you can use Queue to parallelize execution of HaplotypeCaller.
  • If for some reason you cannot use the HaplotyperCaller do fall back to the UnifiedGenotyper protocol below. Otherwise try out the HaplotypeCaller and let us know about your experiences here on the forum!

Phase III: Integrating analyses: getting the best call set possible

This raw VCF file should be as sensitive to variation as you'll get without imputation. At this stage, you can assess things like sensitivity to known variant sites or genotype chip concordance. The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference. All of the subsequent steps are designed to separate out the false positive machine artifacts from the true positive genetic variants.

1. Statistical filtering of the raw calls

The process used here is the Variant quality score recalibrator which builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant in the callset is a true genetic variant or a machine/alignment artifact. All filtering criteria are learned from the data itself.

2. Analysis ready VCF protocol

Take a look at our FAQ page for specific recommendations on which training sets and command-line arguments to use with various project designs. It is expected that analysis will be required by the user when determining the best way to use this tool for different projects. These recommendations are only meant as jumping off points!

We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:

    snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
    indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
    recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
    analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)

3. Notes about small whole exome projects or small target experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples. Also, for experiments that employ targeted resequencing of a small region (for example, a few hundred genes), VQSR may not be empowered regardless of the number of samples in the experiment. For users with experiments containing fewer exome samples or with a small target region there are several options to explore (listed in priority order of what we think will give the best results):

  • Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline).
  • Use the VQSR with the smaller SNP callset but experiment with the argument settings. For example, try adding --maxGaussians 4 --percentBad 0.05 to your command line. Note that this is very dependent on your dataset, and you may need to try some very different settings. It may even not work at all. Unfortunately we cannot give you any specific advice, so please do not post questions on the forum asking for help finding the right parameters.
  • Use hard filters (detailed below).

Recommendations for very small data sets (in terms of both number of samples or size of targeted regions)

These recommended arguments for VariantFiltration are only to used when ALL other options are not available:

You will need to compose filter expressions (see here, here and here for details) to filter on the following annotations and values:

For SNPs:

  • QD < 2.0
  • MQ < 40.0
  • FS > 60.0
  • HaplotypeScore > 13.0
  • MQRankSum < -12.5
  • ReadPosRankSum < -8.0

For indels:

  • QD < 2.0
  • ReadPosRankSum < -20.0
  • InbreedingCoeff < -0.8
  • FS > 200.0

Note that the InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.

For shallow-coverage (<10x): you cannot use filtering to reliably separate true positives from false positives. You must use the protocol involving variant quality score recalibration.

The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.

That said, all of the caveats about determining the right parameters, etc, are annoying and are largely eliminated by variant quality score recalibration.

1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

  • It must be aligned to one of the references described here.
  • It must be sorted in coordinate order (not by queryname and not "unsorted").
  • It must list the read groups with sample names in the header.
  • Every read must belong to a read group.
  • The BAM file must pass Picard validation.

See the BAM specification for more information.

4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

5. How can I tell if my BAM file is sorted properly?

The easiest way to do it is to download Samtools and run the following command to examine the header of your file:

$ samtools view -H /path/to/my.bam
@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:247249719
@SQ     SN:2    LN:242951149
@SQ     SN:3    LN:199501827
@SQ     SN:4    LN:191273063
@SQ     SN:5    LN:180857866
@SQ     SN:6    LN:170899992
@SQ     SN:7    LN:158821424
@SQ     SN:8    LN:146274826
@SQ     SN:9    LN:140273252
@SQ     SN:10   LN:135374737
@SQ     SN:11   LN:134452384
@SQ     SN:12   LN:132349534
@SQ     SN:13   LN:114142980
@SQ     SN:14   LN:106368585
@SQ     SN:15   LN:100338915
@SQ     SN:16   LN:88827254
@SQ     SN:17   LN:78774742
@SQ     SN:18   LN:76117153
@SQ     SN:19   LN:63811651
@SQ     SN:20   LN:62435964
@SQ     SN:21   LN:46944323
@SQ     SN:22   LN:49691432
@SQ     SN:X    LN:154913754
@SQ     SN:Y    LN:57772954
@SQ     SN:MT   LN:16571
@SQ     SN:NT_113887    LN:3994
...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.

6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.

7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).

8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.

9. What's the meaning of the standard read group fields?

For technical details, see the SAM specification on the Samtools website.

Tag Importance SAM spec definition Meaning
ID Required Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read groupIDs may be modified when merging SAM files in order to handle collisions. Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model.
SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper.
PL Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data . It's a good idea to use this field.
LB DNA preparation library identify Essential for MarkDuplicates MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

We do not require value for the CN, DS, DT, PG, PI, or PU fields.

A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

9. My BAM file doesn't have read group and sample information. How do I add it?

Use Picard's AddOrReplaceReadGroups tool to add read group information.

10. How do I know if my BAM file is valid?

Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK.

11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.

1. What file formats do you support for variant callsets?

We support the Variant Call Format (VCF) for variant callsets. No other file formats are supported.

2. How can I know if my VCF file is valid?

VCFTools contains a validation tool that will allow you to verify it.

3. Are you planning to include any converters from different formats or allow different input formats than VCF?

No, we like VCF and we think it's important to have a good standard format. Multiplying formats just makes life hard for everyone, both developers and analysts.

1. What file formats do you support for interval lists?

We support three types of interval lists, as mentioned here. Interval lists should preferentially be formatted as Picard-style interval lists, with an explicit sequence dictionary, as this prevents accidental misuse (e.g. hg18 intervals on an hg19 file). Note that this file is 1-based, not 0-based (first position in the genome is position 1).

2. I have two (or more) sequencing experiments with different target intervals. How can I combine them?

One relatively easy way to combine your intervals is to use the online tool Galaxy, using the Get Data -> Upload command to upload your intervals, and the Operate on Genomic Intervals command to compute the intersection or union of your intervals (depending on your needs).

1. Analysis output overview

In theory, any class implementing the OutputStream interface. In practice, three types of classes are commonly used: PrintStreams for plain text files, SAMFileWriters for BAM files, and VCFWriters for VCF files.

2. PrintStream

To declare a basic PrintStream for output, use the following declaration syntax:

@Output
public PrintStream out;

And use it just as you would any other PrintStream:

out.println("Hello, world!");

By default, @Output streams prepopulate fullName, shortName, required, and doc. required in this context means that the GATK will always fill in the contents of the out field for you. If the user specifies no --out command-line argument, the 'out' field will be prepopulated with a stream pointing to System.out.

If your walker outputs a custom format that requires more than simple concatenation by Queue you should also implement a custom Gatherer.

3. SAMFileWriter

For some applications, you might need to manage their own SAM readers and writers directly from inside your walker. Current best practice for creating these Readers / Writers is to declare arguments of type SAMFileReader or SAMFileWriter as in the following example:

@Output
SAMFileWriter outputBamFile = null;

If you do not specify the full name and short name, the writer will provide system default names for these arguments. Creating a SAMFileWriter in this way will create the type of writer most commonly used by members of the GSA group at the Broad Institute -- it will use the same header as the input BAM and require presorted data. To change either of these attributes, use the StingSAMIterator interface instead:

@Output
StingSAMFileWriter outputBamFile = null;

and later, in initialize(), run one or both of the following methods:

outputBAMFile.writeHeader(customHeader); outputBAMFile.setPresorted(false);

You can change the header or presorted state until the first alignment is written to the file.

4. VCFWriter

VCFWriter outputs behave similarly to PrintStreams and SAMFileWriters. Declare a VCFWriter as follows:

@Output(doc="File to which variants should be written",required=true) protected VCFWriter writer = null;

5. Debugging Output

The walkers provide a protected logger instance. Users can adjust the debug level of the walkers using the -l command line option.

Turning on verbose logging can produce more output than is really necessary. To selectively turn on logging for a class or package, specify a log4j.properties property file from the command line as follows:

-Dlog4j.configuration=file:///<your development root>/Sting/java/config/log4j.properties

An example log4j.properties file is available in the java/config directory of the Git repository.

1. What is Scala?

Scala is a combination of an object oriented framework and a functional programming language. For a good introduction see the free online book Programming Scala.

The following are extremely brief answers to frequently asked questions about Scala which often pop up when first viewing or editing QScripts. For more information on Scala there a multitude of resources available around the web including the Scala home page and the online Scala Doc.

2. Where do I learn more about Scala?

  • http://www.scala-lang.org
  • http://programming-scala.labs.oreilly.com
  • http://www.scala-lang.org/docu/files/ScalaByExample.pdf
  • http://devcheatsheet.com/tag/scala/
  • http://davetron5000.github.com/scala-style/index.html

3. What is the difference between var and val?

var is a value you can later modify, while val is similar to final in Java.

4. What is the difference between Scala collections and Java collections? / Why do I get the error: type mismatch?

Because the GATK and Queue are a mix of Scala and Java sometimes you'll run into problems when you need a Scala collection and instead a Java collection is returned.

   MyQScript.scala:39: error: type mismatch;
     found   : java.util.List[java.lang.String]
     required: scala.List[String]
        val wrapped: List[String] = TextFormattingUtils.wordWrap(text, width)

Use the implicit definitions in JavaConversions to automatically convert the basic Java collections to and from Scala collections.

import collection.JavaConversions._

Scala has a very rich collections framework which you should take the time to enjoy. One of the first things you'll notice is that the default Scala collections are immutable, which means you should treat them as you would a String. When you want to 'modify' an immutable collection you need to capture the result of the operation, often assigning the result back to the original variable.

var str = "A"
str + "B"
println(str) // prints: A
str += "C"
println(str) // prints: AC

var set = Set("A")
set + "B"
println(set) // prints: Set(A)
set += "C"
println(set) // prints: Set(A, C)

5. How do I append to a list?

Use the :+ operator for a single value.

  var myList = List.empty[String]
  myList :+= "a"
  myList :+= "b"
  myList :+= "c"

Use ++ for appending a list.

  var myList = List.empty[String]
  myList ++= List("a", "b", "c")

6. How do I add to a set?

Use the + operator.

  var mySet = Set.empty[String]
  mySet += "a"
  mySet += "b"
  mySet += "c"

7. How do I add to a map?

Use the + and -> operators.

  var myMap = Map.empty[String,Int]
  myMap += "a" -> 1
  myMap += "b" -> 2
  myMap += "c" -> 3

8. What are Option, Some, and None?

Option is a Scala generic type that can either be some generic value or None. Queue often uses it to represent primitives that may be null.

  var myNullableInt1: Option[Int] = Some(1)
  var myNullableInt2: Option[Int] = None

9. What is _ / What is the underscore?

François Armand's slide deck is a good introduction: http://www.slideshare.net/normation/scala-dreaded

To quote from his slides:

Give me a variable name but
- I don't care of what it is
- and/or
- don't want to pollute my namespace with it

10. How do I format a String?

Use the .format() method.

This Java snippet:

String formatted = String.format("%s %i", myString, myInt);

In Scala would be:

val formatted = "%s %i".format(myString, myInt)

11. Can I use Scala Enumerations as QScript @Arguments?

No. Currently Scala's Enumeration class does not interact with the Java reflection API in a way that could be used for Queue command line arguments. You can use Java enums if for example you are importing a Java based walker's enum type.

If/when we find a workaround for Queue we'll update this entry. In the meantime try using a String.

1. Can I use the free IntelliJ IDEA Community Edition to work with Scala and Queue?

Yes. Be sure to install the scala plugin and setup your IDE as listed in [Queue with IntelliJ IDEA(http://gatkforums.broadinstitute.org/discussion/1309/queue-with-intellij-idea).

2. I updated IntelliJ IDEA and lost the ability to use command completion

Check if there is an update to your Scala plugin as well.

3. I can't compile Queue in IntelliJ IDEA / My Scala files are not highlighted correctly

Check your IntelliJ IDEA settings to for the following:

  • The Scala plugin is installed
  • Under File Types have *.scala as a registered pattern for Scala files.

We make various files available for public download from the GSA FTP server, such as the GATK resource bundle and presentation slides. We also maintain a public upload feature for processing bug reports from users.

There are two logins to choose from depending on whether you want to upload or download something:

Downloading

location: ftp.broadinstitute.org
username: gsapubftp-anonymous
password: <blank>

Uploading

location: ftp.broadinstitute.org
username: gsapubftp
password: 5WvQWSfi

The GATK uses two files to access and safety check access to the reference files: a .dict dictionary of the contig names and sizes and a .fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference.

NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names.

Creating the fasta sequence dictionary file

We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file.

> java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done.
Runtime.totalMemory()=2112487424
44.922u 2.308s 0:47.09 100.2%   0+0k 0+0io 2pf+0w

This produces a SAM-style header file describing the contents of our fasta file.

> cat Homo_sapiens_assembly18.dict 
@HD     VN:1.0  SO:unsorted
@SQ     SN:chrM LN:16571        UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d2ed829b8a1628d16cbeee88e88e39eb
@SQ     SN:chr1 LN:247249719    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9ebc6df9496613f373e73396d5b3b6b6
@SQ     SN:chr2 LN:242951149    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:b12c7373e3882120332983be99aeb18d
@SQ     SN:chr3 LN:199501827    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:0e48ed7f305877f66e6fd4addbae2b9a
@SQ     SN:chr4 LN:191273063    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:cf37020337904229dca8401907b626c2
@SQ     SN:chr5 LN:180857866    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:031c851664e31b2c17337fd6f9004858
@SQ     SN:chr6 LN:170899992    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bfe8005c536131276d448ead33f1b583
@SQ     SN:chr7 LN:158821424    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:74239c5ceee3b28f0038123d958114cb
@SQ     SN:chr8 LN:146274826    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:1eb00fe1ce26ce6701d2cd75c35b5ccb
@SQ     SN:chr9 LN:140273252    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:ea244473e525dde0393d353ef94f974b
@SQ     SN:chr10        LN:135374737    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:4ca41bf2d7d33578d2cd7ee9411e1533
@SQ     SN:chr11        LN:134452384    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:425ba5eb6c95b60bafbf2874493a56c3
@SQ     SN:chr12        LN:132349534    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d17d70060c56b4578fa570117bf19716
@SQ     SN:chr13        LN:114142980    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:c4f3084a20380a373bbbdb9ae30da587
@SQ     SN:chr14        LN:106368585    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:c1ff5d44683831e9c7c1db23f93fbb45
@SQ     SN:chr15        LN:100338915    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:5cd9622c459fe0a276b27f6ac06116d8
@SQ     SN:chr16        LN:88827254     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:3e81884229e8dc6b7f258169ec8da246
@SQ     SN:chr17        LN:78774742     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2a5c95ed99c5298bb107f313c7044588
@SQ     SN:chr18        LN:76117153     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:3d11df432bcdc1407835d5ef2ce62634
@SQ     SN:chr19        LN:63811651     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2f1a59077cfad51df907ac25723bff28
@SQ     SN:chr20        LN:62435964     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f126cdf8a6e0c7f379d618ff66beb2da
@SQ     SN:chr21        LN:46944323     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f1b74b7f9f4cdbaeb6832ee86cb426c6
@SQ     SN:chr22        LN:49691432     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2041e6a0c914b48dd537922cca63acb8
@SQ     SN:chrX LN:154913754    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d7e626c80ad172a4d7c95aadb94d9040
@SQ     SN:chrY LN:57772954     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:62f69d0e82a12af74bad85e2e4a8bd91
@SQ     SN:chr1_random  LN:1663265      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:cc05cb1554258add2eb62e88c0746394
@SQ     SN:chr2_random  LN:185571       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:18ceab9e4667a25c8a1f67869a4356ea
@SQ     SN:chr3_random  LN:749256       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9cc571e918ac18afa0b2053262cadab6
@SQ     SN:chr4_random  LN:842648       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9cab2949ccf26ee0f69a875412c93740
@SQ     SN:chr5_random  LN:143687       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:05926bdbff978d4a0906862eb3f773d0
@SQ     SN:chr6_random  LN:1875562      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d62eb2919ba7b9c1d382c011c5218094
@SQ     SN:chr7_random  LN:549659       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:28ebfb89c858edbc4d71ff3f83d52231
@SQ     SN:chr8_random  LN:943810       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:0ed5b088d843d6f6e6b181465b9e82ed
@SQ     SN:chr9_random  LN:1146434      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf
@SQ     SN:chr10_random LN:113275       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:50be2d2c6720dabeff497ffb53189daa
@SQ     SN:chr11_random LN:215294       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bfc93adc30c621d5c83eee3f0d841624
@SQ     SN:chr13_random LN:186858       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:563531689f3dbd691331fd6c5730a88b
@SQ     SN:chr15_random LN:784346       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bf885e99940d2d439d83eba791804a48
@SQ     SN:chr16_random LN:105485       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:dd06ea813a80b59d9c626b31faf6ae7f
@SQ     SN:chr17_random LN:2617613      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:34d5e2005dffdfaaced1d34f60ed8fc2
@SQ     SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f3814841f1939d3ca19072d9e89f3fd7
@SQ     SN:chr19_random LN:301858       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:420ce95da035386cc8c63094288c49e2
@SQ     SN:chr21_random LN:1679693      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:a7252115bfe5bb5525f34d039eecd096
@SQ     SN:chr22_random LN:257318       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:4f2d259b82f7647d3b668063cf18378b
@SQ     SN:chrX_random  LN:1719168      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f4d71e0758986c15e5455bf3e14e5d6f

Creating the fasta index file

We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file.

> samtools faidx Homo_sapiens_assembly18.fasta 
108.446u 3.384s 2:44.61 67.9%   0+0k 0+0io 0pf+0w

This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like:

> cat Homo_sapiens_assembly18.fasta.fai 
chrM    16571   6       50      51
chr1    247249719       16915   50      51
chr2    242951149       252211635       50      51
chr3    199501827       500021813       50      51
chr4    191273063       703513683       50      51
chr5    180857866       898612214       50      51
chr6    170899992       1083087244      50      51
chr7    158821424       1257405242      50      51
chr8    146274826       1419403101      50      51
chr9    140273252       1568603430      50      51
chr10   135374737       1711682155      50      51
chr11   134452384       1849764394      50      51
chr12   132349534       1986905833      50      51
chr13   114142980       2121902365      50      51
chr14   106368585       2238328212      50      51
chr15   100338915       2346824176      50      51
chr16   88827254        2449169877      50      51
chr17   78774742        2539773684      50      51
chr18   76117153        2620123928      50      51
chr19   63811651        2697763432      50      51
chr20   62435964        2762851324      50      51
chr21   46944323        2826536015      50      51
chr22   49691432        2874419232      50      51
chrX    154913754       2925104499      50      51
chrY    57772954        3083116535      50      51
chr1_random     1663265 3142044962      50      51
chr2_random     185571  3143741506      50      51
chr3_random     749256  3143930802      50      51
chr4_random     842648  3144695057      50      51
chr5_random     143687  3145554571      50      51
chr6_random     1875562 3145701145      50      51
chr7_random     549659  3147614232      50      51
chr8_random     943810  3148174898      50      51
chr9_random     1146434 3149137598      50      51
chr10_random    113275  3150306975      50      51
chr11_random    215294  3150422530      50      51
chr13_random    186858  3150642144      50      51
chr15_random    784346  3150832754      50      51
chr16_random    105485  3151632801      50      51
chr17_random    2617613 3151740410      50      51
chr18_random    4262    3154410390      50      51
chr19_random    301858  3154414752      50      51
chr21_random    1679693 3154722662      50      51
chr22_random    257318  3156435963      50      51
chrX_random     1719168 3156698441      50      51
By default, the forum does not send notification messages about new comments or discussions. If you want to turn on notifications or cutomize the type of notifications you want, you need to do the following: Go to your profile page by clicking on your user name; Click on Edit profile; In the menu on the left, click on Notification Preferences; Select the categories that you want to follow and the type of notification you want to receive. Be sure to click on Save Preferences.

Imagine a simple question like, "What's the depth of coverage at position A of the genome?"

First, you are given billions of reads that are aligned to the genome but not ordered in any particular way (except perhaps in the order they were emitted by the sequencer). This simple question is then very difficult to answer efficiently, because the algorithm is forced to examine every single read in succession, since any one of them might span position A. The algorithm must now take several hours in order to compute this value.

Instead, imagine the billions of reads are now sorted in reference order (that is to say, on each chromosome, the reads are stored on disk in the same order they appear on the chromosome). Now, answering the question above is trivial, as the algorithm can jump to the desired location, examine only the reads that span the position, and return immediately after those reads (and only those reads) are inspected. The total number of reads that need to be interrogated is only a handful, rather than several billion, and the processing time is seconds, not hours.

This reference-ordered sorting enables the GATK to process terabytes of data quickly and without tremendous memory overhead. Most GATK tools run very quickly and with less than 2 gigabytes of RAM. Without this sorting, the GATK cannot operate correctly. Thus, it is a fundamental rule of working with the GATK, which is the reason for the Central Dogma of the GATK:

All datasets (reads, alignments, quality scores, variants, dbSNP information, gene tracks, interval lists - everything) must be sorted in order of one of the canonical references sequences.

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. See this page for detailed specifications.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from NGS data, such as the UnifiedGenotyper and the HaplotypeCaller, is especially complex. This document describes some specific features and annotations used in the VCF files output by the GATK tools.

2. Basic structure of a VCF file

The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from our favorite test sample, NA12878:

##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
chr1    873762  .       T   G   5231.78 PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

It seems a bit complex, but the structure of the file is actually quite simple:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
chr1    873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that here everything is a SNP, but some could be indels or CNVs.

3. How variation is represented

The first 6 columns of the VCF, which represent the observed variation, are easy to understand because they have a single, well-defined meaning.

  • CHROM and POS : The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to how indels are represented in a VCF.

  • ID: The dbSNP rs identifier of the SNP, based on the contig and position of the call and whether a record exists at this site in dbSNP.

  • REF and ALT: The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event).

  • QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. These values can grow very large when a large amount of NGS data is used for variant calling.

  • FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis.

For more details about these fields, please see this page.

In the excerpt shown above, here is how we interpret the line corresponding to each variant:

  • chr1:873762 is a novel T/G polymorphism, found with very high confidence (QUAL = 5231.78)
  • chr1:877664 is a known A/G SNP (named rs3828047), found with very high confidence (QUAL = 3931.66)
  • chr1:899282 is a known C/T SNP (named rs28548431), but has a relative low confidence (QUAL = 71.77)
  • chr1:974165 is a known T/C SNP but we have so little evidence for this variant in our data that although we write out a record for it (for book keeping, really) our statistical evidence is so low that we filter the record out as a bad site, as indicated by the "LowQual" annotation.

4. How genotypes are represented

The genotype fields of the VCF look more complicated but they're actually not that hard to interpret once you understand that they're just sets of tags and values. Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

chr1    873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26

Looking at that last column, here is what the tags mean:

  • GT : The genotype of this sample. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    • 0/0 - the sample is homozygous reference
    • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    • 1/1 - the sample is homozygous alternate In the three examples above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
  • GQ: The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset.

  • AD and DP: These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. See the Technical Documentation for details on AD (DepthPerAlleleBySample) and DP (Coverage).

  • PL: This field provides the likelihoods of the given genotypes (here, 0/0, 0/1, and 1/1). These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the heterozygous case, this is L(data given that the true genotype is 0/1). The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.

With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282.

chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26

At this site, the called genotype is GT = 0/1, which is C/T. The confidence (GQ=25.92) isn't so good, largely because there were only a total of 4 reads at this site (DP=4), 1 of which was ref (=had the reference base) and 3 of which were alt (=had the alternate base) (AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value), whereas there's a serious chance that the subject is hom-var (=homozygous with the variant allele) since PL(1/1) = 26 = 10^(-2.6) = 0.25%. Either way, though, it's clear that the subject is definitely not home-ref (=homozygous with the reference allele) here since PL(0/0) = 103 = 10^(-10.3) which is a very small number.

5. Understanding annotations

Finally, variants in a VCF can be annotated with a variety of additional tags, either by the built-in tools or with others that you add yourself. The way they're formatted is similar to what we saw in the Genotype fields, except instead of being in two separate fields (tags and values, respectively) the annotation tags and values are grouped together, so tag-value pairs are written one after another.

chr1    873762  [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473
chr1    877664  [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185
chr1    899282  [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148

Here are some commonly used built-in annotations and what they mean:

Annotation tag in VCF Meaning
AC,AF,AN See the Technical Documentation for Chromosome Counts.
DB If present, then the variant is in dbSNP.
DP See the Technical Documentation for Coverage.
DS Were any of the samples downsampled because of too much coverage?
Dels See the Technical Documentation for SpanningDeletions.
MQ and MQ0 See the Technical Documentation for RMS Mapping Quality and Mapping Quality Zero.
BaseQualityRankSumTest See the Technical Documentation for Base Quality Rank Sum Test.
MappingQualityRankSumTest See the Technical Documentation for Mapping Quality Rank Sum Test.
ReadPosRankSumTest See the Technical Documentation for Read Position Rank Sum Test.
HRun See the Technical Documentation for Homopolymer Run.
HaplotypeScore See the Technical Documentation for Haplotype Score.
QD See the Technical Documentation for Qual By Depth.
VQSLOD Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model.
FS See the Technical Documentation for Fisher Strand
SB How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls).

Objective

Run a basic analysis command on example data, parallelized with Queue.

Prerequisites

Steps

  1. Set up a dry run of Queue
  2. Run the analysis for real
  3. Running on a computing farm

1. Set up a dry run of Queue

One very cool feature of Queue is that you can test your script by doing a "dry run". That means Queue will prepare the analysis and build the scatter commands, but not actually run them. This makes it easier to check the sanity of your script and command.

Here we're going to set up a dry run of a CountReads analysis. You should be familiar with the CountReads walker and the example files from the bundles, as used in the basic "GATK for the first time" tutorial. In addition, we're going to use the example QScript called ExampleCountReads.scala provided in the Queue package download.

Action

Type the following command:

java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam

where -S ExampleCountReads.scala specifies which QScript we want to run, -R exampleFASTA.fasta specifies the reference sequence, and -I exampleBAM.bam specifies the file of aligned reads we want to analyze.

Expected Result

After a few seconds you should see output that looks nearly identical to this:

INFO  00:30:45,527 QScriptManager - Compiling 1 QScript 
INFO  00:30:52,869 QScriptManager - Compilation complete 
INFO  00:30:53,284 HelpFormatter - ---------------------------------------------------------------------- 
INFO  00:30:53,284 HelpFormatter - Queue v2.0-36-gf5c1c1a, Compiled 2012/08/08 20:18:21 
INFO  00:30:53,284 HelpFormatter - Copyright (c) 2012 The Broad Institute 
INFO  00:30:53,284 HelpFormatter - Fro support and documentation go to http://www.broadinstitute.org/gatk 
INFO  00:30:53,285 HelpFormatter - Program Args: -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam 
INFO  00:30:53,285 HelpFormatter - Date/Time: 2012/08/09 00:30:53 
INFO  00:30:53,285 HelpFormatter - ---------------------------------------------------------------------- 
INFO  00:30:53,285 HelpFormatter - ---------------------------------------------------------------------- 
INFO  00:30:53,290 QCommandLine - Scripting ExampleCountReads 
INFO  00:30:53,364 QCommandLine - Added 1 functions 
INFO  00:30:53,364 QGraph - Generating graph. 
INFO  00:30:53,388 QGraph - ------- 
INFO  00:30:53,402 QGraph - Pending:  'java'  '-Xmx1024m'  '-Djava.io.tmpdir=/Users/vdauwera/sandbox/Q2/resources/tmp'  '-cp' '/Users/vdauwera/sandbox/Q2/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'CountReads'  '-I' '/Users/vdauwera/sandbox/Q2/resources/exampleBAM.bam'  '-R' '/Users/vdauwera/sandbox/Q2/resources/exampleFASTA.fasta'  
INFO  00:30:53,403 QGraph - Log:     /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads-1.out 
INFO  00:30:53,403 QGraph - Dry run completed successfully! 
INFO  00:30:53,404 QGraph - Re-run with "-run" to execute the functions. 
INFO  00:30:53,409 QCommandLine - Script completed successfully with 1 total jobs 
INFO  00:30:53,410 QCommandLine - Writing JobLogging GATKReport to file /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads.jobreport.txt 

If you don't see this, check your spelling (GATK commands are case-sensitive), check that the files are in your working directory, and if necessary, re-check that the GATK and Queue are properly installed.

If you do see this output, congratulations! You just successfully ran you first Queue dry run!

2. Run the analysis for real

Once you have verified that the Queue functions have been generated successfully, you can execute the pipeline by appending -run to the command line.

Action

Instead of this command, which we used earlier:

java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam

this time you type this:

java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam -run

See the difference?

Result

You should see output that looks nearly identical to this:

INFO  00:56:33,688 QScriptManager - Compiling 1 QScript 
INFO  00:56:39,327 QScriptManager - Compilation complete 
INFO  00:56:39,487 HelpFormatter - ---------------------------------------------------------------------- 
INFO  00:56:39,487 HelpFormatter - Queue v2.0-36-gf5c1c1a, Compiled 2012/08/08 20:18:21 
INFO  00:56:39,488 HelpFormatter - Copyright (c) 2012 The Broad Institute 
INFO  00:56:39,488 HelpFormatter - Fro support and documentation go to http://www.broadinstitute.org/gatk 
INFO  00:56:39,489 HelpFormatter - Program Args: -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam -run 
INFO  00:56:39,490 HelpFormatter - Date/Time: 2012/08/09 00:56:39 
INFO  00:56:39,490 HelpFormatter - ---------------------------------------------------------------------- 
INFO  00:56:39,491 HelpFormatter - ---------------------------------------------------------------------- 
INFO  00:56:39,498 QCommandLine - Scripting ExampleCountReads 
INFO  00:56:39,569 QCommandLine - Added 1 functions 
INFO  00:56:39,569 QGraph - Generating graph. 
INFO  00:56:39,589 QGraph - Running jobs. 
INFO  00:56:39,623 FunctionEdge - Starting:  'java'  '-Xmx1024m'  '-Djava.io.tmpdir=/Users/vdauwera/sandbox/Q2/resources/tmp'  '-cp' '/Users/vdauwera/sandbox/Q2/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'CountReads'  '-I' '/Users/vdauwera/sandbox/Q2/resources/exampleBAM.bam'  '-R' '/Users/vdauwera/sandbox/Q2/resources/exampleFASTA.fasta'  
INFO  00:56:39,623 FunctionEdge - Output written to /Users/GG/codespace/GATK/Q2/resources/ExampleCountReads-1.out 
INFO  00:56:50,301 QGraph - 0 Pend, 1 Run, 0 Fail, 0 Done 
INFO  00:57:09,827 FunctionEdge - Done:  'java'  '-Xmx1024m'  '-Djava.io.tmpdir=/Users/vdauwera/sandbox/Q2/resources/tmp'  '-cp' '/Users/vdauwera/sandbox/Q2/resources/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'CountReads'  '-I' '/Users/vdauwera/sandbox/Q2/resources/exampleBAM.bam'  '-R' '/Users/vdauwera/sandbox/Q2/resources/exampleFASTA.fasta'  
INFO  00:57:09,828 QGraph - 0 Pend, 0 Run, 0 Fail, 1 Done 
INFO  00:57:09,835 QCommandLine - Script completed successfully with 1 total jobs 
INFO  00:57:09,835 QCommandLine - Writing JobLogging GATKReport to file /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads.jobreport.txt 
INFO  00:57:10,107 QCommandLine - Plotting JobLogging GATKReport to file /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads.jobreport.pdf 
WARN  00:57:18,597 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info. 

Great! It works!

The results of the traversal will be written to a file in the current directory. The name of the file will be printed in the output, ExampleCountReads.out in this example.

If for some reason the run was interrupted, in most cases you can resume by just launching the command. Queue will pick up where it left off without redoing the parts that ran successfully.

3. Running on a computing farm

Run with -bsub to run on LSF, or for early Grid Engine support see Queue with Grid Engine.

See also QFunction and Command Line Options for more info on Queue options.

Objective

Run a basic analysis command on example data.

Prerequisites

Steps

  1. Invoke the GATK CountReads command
  2. Further exercises

1. Invoke the GATK CountReads command

A very simple analysis that you can do with the GATK is getting a count of the reads in a BAM file. The GATK is capable of much more powerful analyses, but this is a good starting example because there are very few things that can go wrong.

So we are going to count the reads in the file exampleBAM.bam, which you can find in the GATK resource bundle along with its associated index (same file name with .bai extension), as well as the example reference exampleFASTA.fasta and its associated index (same file name with .fai extension) and dictionary (same file name with .dict extension). Copy them to your working directory so that your directory contents look like this:

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x  9 vdauwera  CHARLES\Domain Users     306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera  CHARLES\Domain Users     204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users    3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users  101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users      20 Apr 10 07:39 exampleFASTA.fasta.fai

Action

Type the following command:

java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam 

where -T CountReads specifies which analysis tool we want to use, -R exampleFASTA.fasta specifies the reference sequence, and -I exampleBAM.bam specifies the file of aligned reads we want to analyze.

For any analysis that you want to run on a set of aligned reads, you will always need to use at least these three arguments:

  • -T for the tool name, which specifices the corresponding analysis
  • -R for the reference sequence file
  • -I for the input BAM file of aligned reads

They don't have to be in that order in your command, but this way you can remember that you need them if you TRI...

Expected Result

After a few seconds you should see output that looks like to this:

INFO  16:17:45,945 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:17:45,946 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41 
INFO  16:17:45,947 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:17:45,947 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:17:45,947 HelpFormatter - Program Args: -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam 
INFO  16:17:45,947 HelpFormatter - Date/Time: 2012/07/25 16:17:45 
INFO  16:17:45,947 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:17:45,948 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:17:45,950 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:17:45,982 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:17:45,993 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  16:17:46,060 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  16:17:46,060 TraversalEngine -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining 
INFO  16:17:46,061 Walker - [REDUCE RESULT] Traversal result is: 33 
INFO  16:17:46,061 TraversalEngine - Total runtime 0.00 secs, 0.00 min, 0.00 hours 
INFO  16:17:46,100 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) 
INFO  16:17:46,729 GATKRunReport - Uploaded run statistics report to AWS S3 

Depending on the GATK release, you may see slightly different information output, but you know everything is running correctly if you see the line:

INFO  21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33 

somewhere in your output.

If you don't see this, check your spelling (GATK commands are case-sensitive), check that the files are in your working directory, and if necessary, re-check that the GATK is properly installed.

If you do see this output, congratulations! You just successfully ran you first GATK analysis!

Basically the output you see means that the CountReadsWalker (which you invoked with the command line option -T CountReads) counted 33 reads in the exampleBAM.bam file, which is exactly what we expect to see.

Wait, what is this walker thing?

In the GATK jargon, we call the tools walkers because the way they work is that they walk through the dataset --either along the reference sequence (LocusWalkers), or down the list of reads in the BAM file (ReadWalkers)-- collecting the requested information along the way.

2. Further Exercises

Now that you're rocking the read counts, you can start to expand your use of the GATK command line.

Let's say you don't care about counting reads anymore; now you want to know the number of loci (positions on the genome) that are covered by one or more reads. The name of the tool, or walker, that does this is CountLoci. Since the structure of the GATK command is basically always the same, you can simply switch the tool name, right?

Action

Instead of this command, which we used earlier:

java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam 

this time you type this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam 

See the difference?

Result

You should see something like this output:

INFO  16:18:26,183 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:18:26,185 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41 
INFO  16:18:26,185 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:18:26,185 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:18:26,186 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam 
INFO  16:18:26,186 HelpFormatter - Date/Time: 2012/07/25 16:18:26 
INFO  16:18:26,186 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:18:26,186 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:18:26,189 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:18:26,222 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:18:26,233 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  16:18:26,351 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  16:18:26,351 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
2052
INFO  16:18:26,411 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours 
INFO  16:18:26,450 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) 
INFO  16:18:27,124 GATKRunReport - Uploaded run statistics report to AWS S3 

Great! But wait -- where's the result? Last time the result was given on this line:

INFO  21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33 

But this time there is no line that says [REDUCE RESULT]! Is something wrong?

Not really. The program ran just fine -- but we forgot to give it an output file name. You see, the CountLoci walker is set up to output the result of its calculations to a text file, unlike CountReads, which is perfectly happy to output its result to the terminal screen.

Action

So we repeat the command, but this time we specify an output file, like this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt

where -o (lowercase o, not zero) is used to specify the output.

Result

You should get essentially the same output on the terminal screen as previously (but notice the difference in the line that contains Program Args -- the new argument is included):

INFO  16:29:15,451 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:29:15,453 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41 
INFO  16:29:15,453 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:29:15,453 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:29:15,453 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt 
INFO  16:29:15,454 HelpFormatter - Date/Time: 2012/07/25 16:29:15 
INFO  16:29:15,454 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:29:15,454 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:29:15,457 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:29:15,488 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:29:15,499 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  16:29:15,618 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  16:29:15,618 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  16:29:15,679 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours 
INFO  16:29:15,718 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) 
INFO  16:29:16,712 GATKRunReport - Uploaded run statistics report to AWS S3 

This time however, if we look inside the working directory, there is a newly created file there called output.txt.

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x  9 vdauwera  CHARLES\Domain Users     306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera  CHARLES\Domain Users     204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users    3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users  101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users      20 Apr 10 07:39 exampleFASTA.fasta.fai
-rw-r--r--  1 vdauwera  CHARLES\Domain Users       5 Jul 25 16:29 output.txt

This file contains the result of the analysis:

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% cat output.txt 
2052

This means that there are 2052 loci in the reference sequence that are covered by at least one or more reads in the BAM file.

Discussion

Okay then, but why not show the full, correct command in the first place? Because this was a good opportunity for you to learn a few of the caveats of the GATK command system, which may save you a lot of frustration later on.

Beyond the common basic arguments that almost all GATK walkers require, most of them also have specific requirements or options that are important to how they work. You should always check what are the specific arguments that are required, recommended and/or optional for the walker you want to use before starting an analysis.

Fortunately the GATK is set up to complain (i.e. terminate with an error message) if you try to run it without specifying a required argument. For example, if you try to run this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta

the GATK will spit out a wall of text, including the basic usage guide that you can invoke with the --help option, and more importantly, the following error message:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.0-22-g40f97eb): 
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Walker requires reads but none were provided.
##### ERROR ------------------------------------------------------------------------------------------

You see the line that says ERROR MESSAGE: Walker requires reads but none were provided? This tells you exactly what was wrong with your command.

So the GATK will not run if a walker does not have all the required inputs. That's a good thing! But in the case of our first attempt at running CountLoci, the -o argument is not required by the GATK to run -- it's just highly desirable if you actually want the result of the analysis!

There will be many other cases of walkers with arguments that are not strictly required, but highly desirable if you want the results to be meaningful.

So, at the risk of getting repetitive, always read the documentation of each walker that you want to use!

Objective

Test that the GATK is correctly installed, and that the supporting tools like Java are in your path.

Prerequisites

  • Basic familiarity with the command-line environment
  • Understand what is a PATH variable
  • GATK downloaded and placed on path

Steps

  1. Invoke the GATK usage/help message
  2. Troubleshooting

1. Invoke the GATK usage/help message

The command we're going to run is a very simple command that asks the GATK to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your GATK package is installed correctly.

Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.

Action

Type the following command:

java -jar <path to GenomeAnalysisTK.jar> --help

replacing the <path to GenomeAnalysisTK.jar> bit with the path you have set up in your command-line environment.

Expected Result

You should see usage output similar to the following:

usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-I <input_file>] [-L 
        <intervals>] [-R <reference_sequence>] [-B <rodBind>] [-D <DBSNP>] [-H 
        <hapmap>] [-hc <hapmap_chip>] [-o <out>] [-e <err>] [-oe <outerr>] [-A] [-M 
        <maximum_reads>] [-sort <sort_on_the_fly>] [-compress <bam_compression>] [-fmq0] [-dfrac 
        <downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-S 
        <validation_strictness>] [-U] [-P] [-dt] [-tblw] [-nt <numthreads>] [-l 
        <logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h]
-T,--analysis_type <analysis_type>                     Type of analysis to run
-I,--input_file <input_file>                           SAM or BAM file(s)
-L,--intervals <intervals>                             A list of genomic intervals over which 
                                                       to operate. Can be explicitly specified 
                                                       on the command line or in a file.
-R,--reference_sequence <reference_sequence>           Reference sequence file
-B,--rodBind <rodBind>                                 Bindings for reference-ordered data, in 
                                                       the form <name>,<type>,<file>
-D,--DBSNP <DBSNP>                                     DBSNP file
-H,--hapmap <hapmap>                                   Hapmap file
-hc,--hapmap_chip <hapmap_chip>                        Hapmap chip file
-o,--out <out>                                         An output file presented to the walker. 
                                                       Will overwrite contents if file exists.
-e,--err <err>                                         An error output file presented to the 
                                                       walker. Will overwrite contents if file 
                                                       exists.
-oe,--outerr <outerr>                                  A joint file for 'normal' and error 
                                                       output presented to the walker. Will 
                                                       overwrite contents if file exists.

...

If you see this message, your GATK installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.

2. Troubleshooting

Let's try to figure out what's not working.

Action

First, make sure that your Java version is at least 1.6, by typing the following command:

java -version

Expected Result

You should see something similar to the following text:

java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)  

Remedial actions

If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like

java: Command not found  

make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.

On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.

Objective

Test that Queue is correctly installed, and that the supporting tools like Java are in your path.

Prerequisites

  • Basic familiarity with the command-line environment
  • Understand what is a PATH variable
  • GATK installed
  • Queue downloaded and placed on path

Steps

  1. Invoke the Queue usage/help message
  2. Troubleshooting

1. Invoke the Queue usage/help message

The command we're going to run is a very simple command that asks Queue to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your Queue package is installed correctly.

Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.

Action

Type the following command:

java -jar <path to Queue.jar> --help

replacing the <path to Queue.jar> bit with the path you have set up in your command-line environment.

Expected Result

You should see usage output similar to the following:

usage: java -jar Queue.jar -S <script> [-jobPrefix <job_name_prefix>] [-jobQueue <job_queue>] [-jobProject <job_project>]
       [-jobSGDir <job_scatter_gather_directory>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
       <temp_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>] [-emailTLS] [-emailSSL] [-emailUser
       <emailUsername>] [-emailPass <emailPassword>] [-emailPassFile <emailPasswordFile>] [-bsub] [-run] [-dot <dot_graph>]
       [-expandedDot <expanded_dot_graph>] [-startFromScratch] [-status] [-statusFrom <status_email_from>] [-statusTo
       <status_email_to>] [-keepIntermediates] [-retry <retry_failed>] [-l <logging_level>] [-log <log_to_file>] [-quiet]
       [-debug] [-h]

 -S,--script <script>                                                      QScript scala file
 -jobPrefix,--job_name_prefix <job_name_prefix>                            Default name prefix for compute farm jobs.
 -jobQueue,--job_queue <job_queue>                                         Default queue for compute farm jobs.
 -jobProject,--job_project <job_project>                                   Default project for compute farm jobs.
 -jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory>   Default directory to place scatter gather
                                                                           output for compute farm jobs.
 -memLimit,--default_memory_limit <default_memory_limit>                   Default memory limit for jobs, in gigabytes.
 -runDir,--run_directory <run_directory>                                   Root directory to run functions from.
 -tempDir,--temp_directory <temp_directory>                                Temp directory to pass to functions.
 -emailHost,--emailSmtpHost <emailSmtpHost>                                Email SMTP host. Defaults to localhost.
 -emailPort,--emailSmtpPort <emailSmtpPort>                                Email SMTP port. Defaults to 465 for ssl,
                                                                           otherwise 25.
 -emailTLS,--emailUseTLS                                                   Email should use TLS. Defaults to false.
 -emailSSL,--emailUseSSL                                                   Email should use SSL. Defaults to false.
 -emailUser,--emailUsername <emailUsername>                                Email SMTP username. Defaults to none.
 -emailPass,--emailPassword <emailPassword>                                Email SMTP password. Defaults to none. Not
                                                                           secure! See emailPassFile.
 -emailPassFile,--emailPasswordFile <emailPasswordFile>                    Email SMTP password file. Defaults to none.
 -bsub,--bsub_all_jobs                                                     Use bsub to submit jobs
 -run,--run_scripts                                                        Run QScripts.  Without this flag set only
                                                                           performs a dry run.
 -dot,--dot_graph <dot_graph>                                              Outputs the queue graph to a .dot file.  See:
                                                                           http://en.wikipedia.org/wiki/DOT_language
 -expandedDot,--expanded_dot_graph <expanded_dot_graph>                    Outputs the queue graph of scatter gather to
                                                                           a .dot file.  Otherwise overwrites the
                                                                           dot_graph
 -startFromScratch,--start_from_scratch                                    Runs all command line functions even if the
                                                                           outputs were previously output successfully.
 -status,--status                                                          Get status of jobs for the qscript
 -statusFrom,--status_email_from <status_email_from>                       Email address to send emails from upon
                                                                           completion or on error.
 -statusTo,--status_email_to <status_email_to>                             Email address to send emails to upon
                                                                           completion or on error.
 -keepIntermediates,--keep_intermediate_outputs                            After a successful run keep the outputs of
                                                                           any Function marked as intermediate.
 -retry,--retry_failed <retry_failed>                                      Retry the specified number of times after a
                                                                           command fails.  Defaults to no retries.
 -l,--logging_level <logging_level>                                        Set the minimum level of logging, i.e.
                                                                           setting INFO get's you INFO up to FATAL,
                                                                           setting ERROR gets you ERROR and FATAL level
                                                                           logging.
 -log,--log_to_file <log_to_file>                                          Set the logging location
 -quiet,--quiet_output_mode                                                Set the logging to quiet mode, no output to
                                                                           stdout
 -debug,--debug_mode                                                       Set the logging file string to include a lot
                                                                           of debugging information (SLOW!)
 -h,--help                                                                 Generate this help message

If you see this message, your Queue installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.

2. Troubleshooting

Let's try to figure out what's not working.

Action

First, make sure that your Java version is at least 1.6, by typing the following command:

java -version

Expected Result

You should see something similar to the following text:

java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)  

Remedial actions

If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like

java: Command not found  

make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.

On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.

1. Naming walkers

Users identify which GATK walker to run by specifying a walker name via the --analysis_type command-line argument. By default, the GATK will derive the walker name from a walker by taking the name of the walker class and removing packaging information from the start of the name, and removing the trailing text Walker from the end of the name, if it exists. For example, the GATK would, by default, assign the name PrintReads to the walker class org.broadinstitute.sting.gatk.walkers.PrintReadsWalker. To override the default walker name, annotate your walker class with @WalkerName("<my name>").

2. Requiring / allowing primary inputs

Walkers can flag exactly which primary data sources are allowed and required for a given walker. Reads, the reference, and reference-ordered data are currently considered primary data sources. Different traversal types have different default requirements for reads and reference, but currently no traversal types require reference-ordered data by default. You can add requirements to your walker with the @Requires / @Allows annotations as follows:

@Requires(DataSource.READS)
@Requires({DataSource.READS,DataSource.REFERENCE})
@Requires(value={DataSource.READS,DataSource.REFERENCE})
@Requires(value=DataSource.REFERENCE})

By default, all parameters are allowed unless you lock them down with the @Allows attribute. The command:

@Allows(value={DataSource.READS,DataSource.REFERENCE})

will only allow the reads and the reference. Any other primary data sources will cause the system to exit with an error.

Note that as of August 2011, the GATK no longer supports RMD the @Requires and @Allows syntax, as these have moved to the standard @Argument system.

3. Command-line argument tagging

Any command-line argument can be tagged with a comma-separated list of freeform tags.

The syntax for tags is as follows:

-<argument>:<tag1>,<tag2>,<tag3> <argument value>

for example:

-I:tumor <my tumor data>.bam
-eval,VCF yri.trio.chr1.vcf

There is currently no mechanism in the GATK to validate either the number of tags supplied or the content of those tags.

Tags can be accessed from within a walker by calling getToolkit().getTags(argumentValue), where argumentValue is the parsed contents of the command-line argument to inspect.

Applications

The GATK currently has comprehensive support for tags on two built-in argument types:

  • -I,--input_file <input_file>

    Input BAM files and BAM file lists can be tagged with any type. When a BAM file list is tagged, the tag is applied to each listed BAM file.

From within a walker, use the following code to access the supplied tag or tags:

getToolkit().getReaderIDForRead(read).getTags();
  • Input RODs, e.g. `-V ' or '-eval '

    Tags are used to specify ROD name and ROD type. There is currently no support for adding additional tags. See the ROD system documentation for more details.

4. Adding additional command-line arguments

Users can create command-line arguments for walkers by creating public member variables annotated with @Argument in the walker. The @Argument annotation takes a number of differentparameters:

  • fullName

    The full name of this argument. Defaults to the toLowerCase()’d member name. When specifying fullName on the command line, prefix with a double dash (--).

  • shortName

    The alternate, short name for this argument. Defaults to the first letter of the member name. When specifying shortName on the command line, prefix with a single dash (-).

  • doc

    Documentation for this argument. Will appear in help output when a user either requests help with the –-help (-h) argument or when a user specifies an invalid set of arguments. Documentation is the only argument that is always required.

  • required

    Whether the argument is required when used with this walker. Default is required = true.

  • exclusiveOf

    Specifies that this argument is mutually exclusive of another argument in the same walker. Defaults to not mutually exclusive of any other arguments.

  • validation

    Specifies a regular expression used to validate the contents of the command-line argument. If the text provided by the user does not match this regex, the GATK will abort with an error.

By default, all command-line arguments will appear in the help system. To prevent new and debugging arguments from appearing in the help system, you can add the @Hidden tag below the @Argument annotation, hiding it from the help system but allowing users to supply it on the command-line. Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.

Passing Command-Line Arguments

Arguments can be passed to the walker using either the full name or the short name. If passing arguments using the full name, the syntax is −−<arg full name> <value>.

--myint 6

If passing arguments using the short name, the syntax is -<arg short name> <value>. Note that there is a space between the short name and the value:

-m 6

Boolean (class) and boolean (primitive) arguments are a special in that they require no argument. The presence of a boolean indicates true, and its absence indicates false. The following example sets a flag to true.

-B

Supplemental command-line argument annotations

Two additional annotations can influence the behavior of command-line arguments.

  • @Hidden

    Adding this annotation to an @Argument tells the help system to avoid displaying any evidence that this argument exists. This can be used to add additional debugging arguments that aren't suitable for mass consumption.

  • @Deprecated

    Forces the GATK to throw an exception if this argument is supplied on the command-line. This can be used to supply extra documentation to the user as command-line parameters change for walkers that are in flux.

Examples

Create an required int parameter with full name –myint, short name -m. Pass this argument by adding –myint 6 or -m 6 to the command line.

import org.broadinstitute.sting.utils.cmdLine.Argument;
public class HelloWalker extends ReadWalker<Integer,Long> {
    @Argument(doc="my integer")
    public int myInt;

Create an optional float parameter with full name –myFloatingPointArgument, short name -m. Pass this argument by adding –myFloatingPointArgument 2.71 or -m 2.71.

import org.broadinstitute.sting.utils.cmdLine.Argument;
public class HelloWalker extends ReadWalker<Integer,Long> {
    @Argument(fullName="myFloatingPointArgument",doc="a floating point argument",required=false)
    public float myFloat;

The GATK will parse the argument differently depending on the type of the public member variable’s type. Many different argument types are supported, including primitives and their wrappers, arrays, typed and untyped collections, and any type with a String constructor. When the GATK cannot completely infer the type (such as in the case of untyped collections), it will assume that the argument is a String. GATK is aware of concrete implementations of some interfaces and abstract classes. If the argument’s member variable is of type List or Set, the GATK will fill the member variable with a concrete ArrayList or TreeSet, respectively. Maps are not currently supported.

5. Additional argument types: @Input, @Output

Besides @Argument, the GATK provides two additional types for command-line arguments: @Input and @Output. These two inputs are very similar to @Argument but act as flags to indicate dataflow to Queue, our pipeline management software.

  • The @Input tag indicates that the contents of the tagged field represents a file that will be read by the walker.

  • The @Output tag indicates that the contents of the tagged field represents a file that will be written by the walker, for consumption by downstream walkers.

We're still determining the best way to model walker dependencies in our pipeline. As we determine best practices, we'll post them here.

6. Getting access to Reference Ordered Data (RMD) with @Input and RodBinding

As of August 2011, the GATK now provides a clean mechanism for creating walker @Input arguments and using these arguments to access Reference Meta Data provided by the RefMetaDataTracker in the map() call. This mechanism is preferred to the old implicit string-based mechanism, which has been retired.

At a very high level, the new RodBindings provide a handle for a walker to obtain the Feature records from Tribble from a map() call, specific to a command line binding provided by the user. This can be as simple as a single ROD file argument|one-to-one binding between a command line argument and a track, or as complex as an argument argument accepting multiple command line arguments, each with a specific name. The RodBindings are generic and type specific, so you can require users to provide files that emit VariantContexts, BedTables, etc, or simply the root type Feature from Tribble. Critically, the RodBindings interact nicely with the GATKDocs system, so you can provide summary and detailed documentation for each RodBinding accepted by your walker.

A single ROD file argument

Suppose you have a walker that uses a single track of VariantContexts, such as SelectVariants, in its calculation. You declare a standard GATK-style @Input argument in the walker, of type RodBinding<VariantContext>:

@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;

This will require the user to provide a command line option --variant:vcf my.vcf to your walker. To get access to your variants, in the map() function you provide the variants variable to the tracker, as in:

Collection<VariantContext> vcs = tracker.getValues(variants, context.getLocation());

which returns all of the VariantContexts in variants that start at context.getLocation(). See RefMetaDataTracker in the javadocs to see the full range of getter routines.

Note that, as with regular tribble tracks, you have to provide the Tribble type of the file as a tag to the argument (:vcf). The system now checks up front that the corresponding Tribble codec produces Features that are type-compatible with the type of the RodBinding<T>.

RodBindings are generic

The RodBinding class is generic, parameterized as RodBinding<T extends Feature>. This T class describes the type of the Feature required by the walker. The best practice for declaring a RodBinding is to choose the most general Feature type that will allow your walker to work. For example, if all you really care about is whether a Feature overlaps the site in map, you can use Feature itself, which supports this, and will allow any Tribble type to be provided, using a RodBinding<Feature>. If you are manipulating VariantContexts, you should declare a RodBinding<VariantContext>, which will restrict automatically the user to providing Tribble types that can create a object consistent with the VariantContext class (a VariantContext itself or subclass).

Note that in multi-argument RodBindings, as List<RodBinding<T>> arg, the system will require all files provided here to provide an object of type T. So List<RodBinding<VariantContext>> arg requires all -arg command line arguments to bind to files that produce VariantContexts.

An argument that can be provided any number of times

The RodBinding system supports the standard @Argument style of allowing a vararg argument by wrapping it in a Java collection. For example, if you want to allow users to provide any number of comp tracks to your walker, simply declare a List<RodBinding<VariantContext>> field:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
public List<RodBinding<VariantContext>> comps;

With this declaration, your walker will accept any number of -comp arguments, as in:

-comp:vcf 1.vcf -comp:vcf 2.vcf -comp:vcf 3.vcf

For such a command line, the comps field would be initialized to the List with three RodBindings, the first binding to 1.vcf, the second to 2.vcf and finally the third to 3.vcf.

Because this is a required argument, at least one -comp must be provided. Vararg @Input RodBindings can be optional, but you should follow proper varargs style to get the best results.

Proper handling of optional arguments

If you want to make a RodBinding optional, you first need to tell the @Input argument that its options (required=false):

@Input(fullName="discordance", required=false)
private RodBinding<VariantContext> discordanceTrack;

The GATK automagically sets this field to the value of the special static constructor method makeUnbound(Class c) to create a special "unbound" RodBinding here. This unbound object is type safe, can be safely passed to the RefMetaDataTracker get methods, and is guaranteed to never return any values. It also returns false when the isBound() method is called.

An example usage of isBound is to conditionally add header lines, as in:

if ( mask.isBound() ) {
    hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));
}

The case for vararg style RodBindings is slightly different. If you want, as above, users to be able to omit the -comp track entirely, you should initialize the value of the collection to the appropriate emptyList/emptySet in Collections:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=false)
public List<RodBinding<VariantContext>> comps = Collections.emptyList();

which will ensure that comps.isEmpty() is true when no -comp is provided.

Implicit and explicit names for RodBindings

@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;

By default, the getName() method in RodBinding returns the fullName of the @Input. This can be overloaded on the command-line by providing not one but two tags. The first tag is interpreted as the name for the binding, and the second as the type. As in:

-variant:vcf foo.vcf     => getName() == "variant"
-variant:foo,vcf foo.vcf => getName() == "foo"

This capability is useful when users need to provide more meaningful names for arguments, especially with variable arguments. For example, in VariantEval, there's a List<RodBinding<VariantContext>> comps, which may be dbsnp, hapmap, etc. This would be declared as:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
public List<RodBinding<VariantContext>> comps;

where a normal command line usage would look like:

-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf

In the code, you might have a loop that looks like:

for ( final RodBinding comp : comps )
    for ( final VariantContext vc : tracker.getValues(comp, context.getLocation())
        out.printf("%s has a binding at %s%n", comp.getName(), getToolkit().getGenomeLocParser.createGenomeLoc(vc)); 

which would print out lines that included things like:

hapmap has a binding at 1:10
omni has a binding at 1:20
hapmap has a binding at 1:30
1000g has a binding at 1:30

This last example begs the question -- what happens with getName() when explicit names are not provided? The system goes out of its way to provide reasonable names for the variables:

  • The first occurrence is named for the fullName, where comp

  • Subsequent occurrences are postfixed with an integer count, starting at 2, so comp2, comp3, etc.

In the above example, the command line

-comp:vcf hapmap.vcf -comp:vcf omni.vcf -comp:vcf 1000g.vcf

would emit

comp has a binding at 1:10
comp2 has a binding at 1:20
comp has a binding at 1:30
comp3 has a binding at 1:30

Dynamic type resolution

The new RodBinding system supports a simple form of dynamic type resolution. If the input filetype can be specially associated with a single Tribble type (as VCF can), then you can omit the type entirely from the the command-line binding of a RodBinding!

So whereas a full command line would look like:

-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf

because these are VCF files they could technically be provided as:

-comp:hapmap hapmap.vcf -comp:omni omni.vcf -comp:1000g 1000g.vcf

If you don't care about naming, you can now say:

-comp hapmap.vcf -comp omni.vcf -comp 1000g.vcf

Best practice for documenting a RodBinding

The best practice is simple: use a javadoc style comment above the @Input annotation, with the standard first line summary and subsequent detailed discussion of the meaning of the argument. These are then picked up by the GATKdocs system and added to the standard walker docs, following the standard structure of GATKDocs @Argument docs. Below is a best practice documentation example from SelectVariants, which accepts a required variant track and two optional discordance and concordance tracks.

public class SelectVariants extends RodWalker<Integer, Integer> {
   /**
     * Variants from this file are sent through the filtering and modifying routines as directed
     * by the arguments to SelectVariants, and finally are emitted.
     */
    @Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
    public RodBinding<VariantContext> variants;

    /**
     * A site is considered discordant if there exists some sample in eval that has a non-reference genotype
     * and either the site isn't present in this track, the sample isn't present in this track,
     * or the sample is called reference in this track.
     */
    @Input(fullName="discordance", shortName = "disc", doc="Output variants that were not called in this Feature comparison track", required=false)
    private RodBinding<VariantContext> discordanceTrack;

    /**
     * A site is considered concordant if (1) we are not looking for specific samples and there is a variant called
     * in both variants and concordance tracks or (2) every sample present in eval is present in the concordance
     * track and they have the sample genotype call.
     */
    @Input(fullName="concordance", shortName = "conc", doc="Output variants that were also called in this Feature comparison track", required=false)
    private RodBinding<VariantContext> concordanceTrack;
}

Note how much better the above version is compared to the old pre-Rodbinding syntax (code below). Below you have a required argument variant that doesn't show up as a formal argument in the GATK, different from the conceptually similar @Arguments for discordanceRodName and concordanceRodName, which have no type restrictions. There's no place to document the variant argument as well, so the system is effectively blind to this essential argument.

@Requires(value={},referenceMetaData=@RMD(name="variant", type=VariantContext.class))
public class SelectVariants extends RodWalker<Integer, Integer> {
    @Argument(fullName="discordance", shortName =  "disc", doc="Output variants that were not called on a ROD comparison track. Use -disc ROD_NAME", required=false)
    private String discordanceRodName = "";

    @Argument(fullName="concordance", shortName =  "conc", doc="Output variants that were also called on a ROD comparison track. Use -conc ROD_NAME", required=false)
    private String concordanceRodName = "";
}

RodBinding examples

In these examples, we have declared two RodBindings in the Walker

@Input(fullName="mask", doc="Input ROD mask", required=false)
public RodBinding<Feature> mask = RodBinding.makeUnbound(Feature.class);

@Input(fullName="comp", doc="Comparison track", required=false)
public List<RodBinding<VariantContext>> comps = new ArrayList<VariantContext>();
  • Get the first value

    Feature f = tracker.getFirstValue(mask)

  • Get all of the values at a location

    Collection<Feature> fs = tracker.getValues(mask, thisGenomeLoc)

  • Get all of the features here, regardless of track

    Collection<Feature> fs = tracker.getValues(Feature.class)

  • Determining if an optional RodBinding was provided . if ( mask.isBound() ) // writes out the mask header line, if one was provided hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));

    if ( ! comps.isEmpty() ) logger.info("At least one comp was provided")

Example usage in Queue scripts

In QScripts when you need to tag a file use the class TaggedFile which extends from java.io.File.

Example in the QScript on the Command Line
Untagged VCF myWalker.variant = new File("my.vcf") -V my.vcf
Tagged VCF myWalker.variant = new TaggedFile("my.vcf", "VCF") -V:VCF my.vcf
Tagged VCF myWalker.variant = new TaggedFile("my.vcf", "VCF,custom=value") -V:VCF,custom=value my.vcf
Labeling a tumor myWalker.input_file :+= new TaggedFile("mytumor.bam", "tumor") -I:tumor mytumor.bam

Notes

No longer need to (or can) use @Requires and @Allows for ROD data. This system is now retired.

The primary goal of the GATK is to provide a suite of small data access patterns that can easily be parallelized and otherwise externally managed. As such, rather than asking walker authors how to iterate over a data stream, the GATK asks the user how data should be presented.

Locus walkers

Walk over the data set one location (single-base locus) at a time, presenting all overlapping reads, reference bases, and reference-ordered data.

1. Switching between covered and uncovered loci

The @By attribute can be used to control whether locus walkers see all loci or just covered loci. To switch between viewing all loci and covered loci, apply one of the following attributes:

@By(DataSource.REFERENCE)
@By(DataSource.READS)

2. Filtering defaults

By default, the following filters are automatically added to every locus walker.

  • Reads with nonsensical alignments
  • Unmapped reads
  • Non-primary alignments.
  • Duplicate reads.
  • Reads failing vendor quality checks.

ROD walkers

These walkers walk over the data set one location at a time, but only those locations covered by reference-ordered data. They are essentially a special case of locus walkers. ROD walkers are read-free traversals that include operate over Reference Ordered Data and the reference genome at sites where there is ROD information. They are geared for high-performance traversal of many RODs and the reference such as VariantEval and CallSetConcordance. Programmatically they are nearly identical to RefWalkers<M,T> traversals with the following few quirks.

1. Differences from a RefWalker

  • RODWalkers are only called at sites where there is at least one non-interval ROD bound. For example, if you are exploring dbSNP and some GELI call set, the map function of a RODWalker will be invoked at all sites where there is a dbSNP record or a GELI record.

  • Because of this skipping RODWalkers receive a context object where the number of reference skipped bases between map calls is provided:

    nSites += context.getSkippedBases() + 1; // the skipped bases plus the current location

In order to get the final count of skipped bases at the end of an interval (or chromosome) the map function is called one last time with null ReferenceContext and RefMetaDataTracker objects. The alignment context can be accessed to get the bases skipped between the last (and final) ROD and the end of the current interval.

2. Filtering defaults

ROD walkers inherit the same filters as locus walkers:

  • Reads with nonsensical alignments
  • Unmapped reads
  • Non-primary alignments.
  • Duplicate reads.
  • Reads failing vendor quality checks.

3. Example change over of VariantEval

Changing to a RODWalker is very easy -- here's the new top of VariantEval, changing the system to a RodWalker from its old RefWalker state:

//public class VariantEvalWalker extends RefWalker<Integer, Integer> {
public class VariantEvalWalker extends RodWalker<Integer, Integer> {

The map function must now capture the number of skipped bases and protect itself from the final interval map calls:

public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
    nMappedSites += context.getSkippedBases();

    if ( ref == null ) { // we are seeing the last site
        return 0;
    }

    nMappedSites++;

That's all there is to it!

4. Performance improvements

A ROD walker can be very efficient compared to a RefWalker in the situation where you have sparse RODs. Here is a comparison of ROD vs. Ref walker implementation of VariantEval:

RODWalker RefWalker
dbSNP and 1KG Pilot 2 SNP calls on chr1 164u (s) 768u (s)
Just 1KG Pilot 2 SNP calls on chr1 54u (s) 666u (s)

Read walkers

Read walkers walk over the data set one read at a time, presenting all overlapping reference bases and reference-ordered data.

Filtering defaults

By default, the following filters are automatically added to every read walker.

  • Reads with nonsensical alignments

Read pair walkers

Read pair walkers walk over a queryname-sorted BAM, presenting each mate and its pair. No reference bases or reference-ordered data are presented.

Filtering defaults

By default, the following filters are automatically added to every read pair walker.

  • Reads with nonsensical alignments

Duplicate walkers

Duplicate walkers walk over a read and all its marked duplicates. No reference bases or reference-ordered data are presented.

Filtering defaults

By default, the following filters are automatically added to every duplicate walker.

  • Reads with nonsensical alignments
  • Unmapped reads
  • Non-primary alignments

1. Introduction

GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:

  • Local realignment around indels
  • Emitting raw SNP calls
  • Emitting indels
  • Masking the SNPs at indels
  • Annotating SNPs using chip data
  • Labeling suspicious calls based on filters
  • Creating a summary report with statistics

Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.

With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.

2. Obtaining Queue

You have two options: donwload the binary distribution (prepackaged, ready to run program) or build it from source.

  • #### Download the binary

This is obviously the easiest way to go. Links are on the Downloads page.

  • #### Building Queue from source

Briefly, here's what you need to know/do:

Queue is part of the Sting repository. Download the source from our repository on Github. Run the following command:

git clone git://github.com/broadgsa/gatk.git Sting

Use ant to build the source.

cd Sting
ant queue

Queue uses the Ivy dependency manager to fetch all other dependencies. Just make sure you have suitable versions of the JDK and Ant!

See this article on how to test your installation of Queue.

3. Running Queue

See this article on running Queue for the first time for full details.

Queue arguments can be listed by running with --help

java -jar dist/Queue.jar --help

To list the arguments required by a QScript, add the script with -S and run with --help.

java -jar dist/Queue.jar -S script.scala --help

Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run.

See QFunction and Command Line Options for more info on adjusting Queue options.

4. QScripts

General Information

Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts.

Every QScript includes the following steps:

  • New instances of CommandLineFunctions are created
  • Input and output arguments are specified on each function
  • The function is added with add() to Queue for dispatch and monitoring

The basic command-line to run the Queue pipelines on the command line is

java -jar Queue.jar -S <script>.scala

See the main article Queue QScripts for more info on QScripts.

Supported QScripts

While most QScripts are analysis pipelines that are custom-built for specific projects, some have been released as supported tools. See

Example QScripts

The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples

5. Visualization and Queue

QJobReport

Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport.

Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:

bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile .libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")

Caveats

  • The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves

  • This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.

DOT visualization of Pipelines

Queue emits a queue.dot file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline.

To clarify your pipeline, override the dotString() function:

class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
    @Input(doc="foo") var bam = bamIn
    @Input(doc="foo") var bamIndex = bai(bamIn)
    @Output(doc="foo") var recalData = recalDataIn
    memoryLimit = Some(4)
    override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
    def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}

Here we only see CountCovariates my.bam [-OQ], for example, in the dot file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:

6. Further reading

This article is out of date and has been replaced by updated documents:

- Primer on parallelism

- Specific usage recommendations

The old document, for archival purposes:

1. Overview

The MapReduce architecture of the GATK allows most walkers in the GATK to be run in a parallel-processing mode. The GATK supports two basic parallel processing models known as shared memory and scatter-gather.

  • Shared memory parallelism

    Parallelism within a single multi-threading process with access to a large, shared RAM. Shared memory parallelism is stable and supported by many tools that access pileups of bases.

  • Scatter/gather (SG) parallelism

    In SG parallelism, the target genomic regions are divided up into N independent GATK instances that are run separately on a single machine or across a computing cluster. The outputs of each independent walker, are merged together once all are completed. SG works very efficiently in the GATK, provided the output of a walker is independent per site (e.g. Unified Genotyper) or per chromosome (e.g. Indel Realigner). SG parallelism is a completely stable approach in the GATK, and used routinely by the GATK team in processing large data sets; it is also natively supported by GATK-Queue, which automatically scatters and gathers GATK processes given a desired N number of processes to execute simultaneously.

Note that parallel-processing will significantly speed up data processing but may produce statistically insignificant differences. While this non-determinism is not ideal in practice the minute differences have been mathematically meaningless while producing consistent results in a reasonable amount of time for whole genome and whole exome data. However, if absolute determinism is more important than speed we recommend you do not use parallelism with the GATK.

2. Comparison of GATK parallelism options

There are costs and benefits to each type of parallelism in the GATK, as outlined in the following table.

Comparison of standard parallelism approaches in the GATK

Property Shared Memory Scatter/Gather
Stability Stable Stable | Retired in codebase
Applicable walker types By locus and by ROD only. ReadWalkers are not supported. All walker types. ReadWalkers can only be split safely by chromosome in general
Example walkers UnifiedGenotyper, BaseRecalibrator, VariantEval All walkers, including ReadWalkers like IndelRealigner
Scalability Fewer than 32 cores. Each thread operates completely independently, so N threads requires N times more memory than 1 thread alone. Best scaling at 8 or fewer threads. Hundreds of processes. Limited by capabilities of the underlying storage system, in general. Isilon-class storage can run thousands of jobs effectively.
How to enable Use the -nt argument in the Engine, on any walker that supports shared memory parallelism (the engine will tell you if it does not). 1. Provide -L interval lists to the GATK; a different one for each of the N independent GATK tools. For example -L chr1 for first process, and -L chr2 for the second. When all processes have finished, merge the output together, as appropriate (e.g. use MergeSam.jar for BAMs, and cat or CombineVariants for VCFs). 2. Use GATK-Queue to automatically divide up your GATK jobs. For example, using this.scatterCount = 10 argument will result in 10 independent processes.
Pros - Easy to enable. - Single output file merged together by internally by the GATK engine - Efficiently uses multi-core processors sharing a single memory space - Works for all walker types, including ReadWalkers - Scales to hundreds of independent jobs - Easy to enable with single -L argument - Directly supported and managed by GATK-Queue - Totally independent processing per interval -- failed parts can be easily resumed without repeating already successfully processed regions
Cons - Limited to fewer than 32 processors without significant overhead - Limited to cores physically present on the machine, cannot take advantage of computing cluster resources - Does not work for ReadWalkers (Table Recalibrator, Indel Realigner) - Requires manual preparation of sliced genomic intervals for processing (if you aren't using GATK-Queue). - For ReadWalkers and other tools that can only be processed by chromosome, leading to load balancing problems (chr1 is much bigger than chr22) - Sensitive to data density variation over the genome. Dividing chr20 processing in 63 1MB chunks leads to 10x differences in completion times due to data pileups around the centromere, for example. - Must wait until all parts of the scatter have completed before gathering, making the process sensitive to farm scheduling problems. If a job will run for M minutes, but waits Z minutes to start on the farm, the entire SG process will not complete for at least M + Z minutes.

3. Which parallelism option is right for me?

Almost certainly, either shared memory or scatter/gather parallelism is the right choice for your NGS pipeline. Our go-to option for parallelism here at the Broad Institute is S/G, since S/G allows you to cut up your jobs into hundreds of pieces to run on our standard computing farm, using GATK-Queue. When I have a small job that needs to be run quickly, am testing out program options or need a quick VariantEval result, I'm using shared memory parallelism with ~10 threads on a single large computer with a lot (>64 GB) of memory.

Basically, if I have N processors, and I want to choose between shared memory or S/G, here's how I would choose:

  • If all N processors are on a single computer, and my walker supports it, use shared memory parallelism

  • If not, use S/G

4. Shared Memory Parallelism (Stable)

The GATK currently supports a hierarchical version of parallelism. In this form of parallelism, data is divided into shards, each shard is map/reduced independently, and the results are combined with a 'tree reduce' step. While the framework handles much of the heavy lifting of data division required for parallelism, each walker must individually be reviewed to ensure that it isn't tracking state internally in a non-threadsafe way. Many tools support shared memory parallelism, including critical tools such as:

  • UnifiedGenotyper
  • CountCovariates
  • VariantEval

Please review the source to discover if your walker is parallelizable, or attempt to run it with -nt 2 and if it the engine doesn't complain the walker is parallelized.

In shared memory parallelism, each thread operates on a 16 kbp shard of reference sequence in a completely independent memory space from all other threads. Up to 24 threads can run efficiently in this design, but greater parallelism is limited by resource starvation from the single reduce thread and/or memory inefficiency by keeping each thread’s data totally independent. See gatkParallelism performance 082112 these plots for an analysis of the scalability of several key GATK tools as a function of nt.

Enabling n-way parallelism from the command-line

Run the GATK, using the -nt command-line argument to specify the number of threads that the GATK should attempt to use.

[[image:HierarchicalParallelism.png|thumb|Shared memory parallelism architecture]]

Helpful hints: Implementing a walker with parallelism in mind

First, be aware that some walkers may, by design, require a rewrite for complete parallelization.

  • When implementing a standard (non-parallelized) walker, one must implement the reduce method, which combines an individual datum returned by the map function with the aggregate of the prior map calls. When implementing a parallelizable walker, one must also implement the org.broadinstitute.sting.gatk.walkers.TreeReducible interface and the treeReduce() function. The TreeReduce function tells the GATK how to combine two adjacent reduces, rather than one map result and one reduce.

  • The GATK supports writing to output files from either the map or the reduce when running in parallel. However, only unbuffered writers are currently supported. Please use PrintStream rather than PrintWriter at this time.

Limitations

The GATK's support for parallelism is currently limited. The following classes of walkers are not supported by our parallelization framework:

  • Read walkers
  • Reduce-by-interval walkers

Note that each thread operates completely independently in the current GATK implementation of shared memory parallelism. So if you provide 1Gb to the GATK with nt 1, then you should provide 4Gb to run with nt 4. If you don't do this, it is possible to starve out the GATK so that it runs much much slower.

The performance of the multi-threaded GATK is really dependent on whether you are IO or CPU limited and the relative overhead of the parallelism on your computer. Additionally, nt can start to have very high overheads with nt > 20 due to our implementation and memory contention issues.

The best option for nt is a value less or equal to the number of available cores with sufficient memory to run each threads (nt x amount provided to 1 core), capped additionally by the available IO bandwidth so that the individual threads don't starve each other.

5. Scatter / gather parallelism

Scatter / gather is a simple approach for process-independent parallelism with the GATK. First you scatter multiple independent GATK instances out over a network of computing nodes, each operating on slightly different genomic intervals, and when they all complete, the output of each is gathered up into a merged output dataset. In the GATK S/G is extremely easy to use, as all of the GATK tools support the -L option to operate over only genomic specific intervals, and many tools emit files that can be merged together across multiple regions. Unified Genotyper, for example, can operate over the whole genome in one process, or on each chromosome independently. The output of this later approach, after merging together, should be the same as the whole genome results, minus any slight differences due to random number effects. The simplicity and efficiency of S/G parallelism makes this a key option for getting things done quickly with the GATK.

Using S/G parallelism is extremely easy, either by hand or using the automated Scatter/Gather in GATK-Queue. Suppose I have the following command line:

java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1

This runs a single process of the GATK over chr1, and let's say it takes an hour when I run it. In order to run it with S/G parallelism mode, just partition chr1 into two independent parts:

This file distributed.tracker.txt will contain genomic locations and GATK process ids that are processing each location, in text format, so you can cat it. If you run at the command line:

gsa1> java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1:1-125,000,000 -o my.1.vcf &
gsa1> java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1:125,000,001-249,250,621 -o my.2.vcf &

When these two jobs finish, I just merge the two VCFs together and I've got a complete data set in half the time.

1. Introduction

As mentioned in the introductory materials, the core concept behind the GATK tools is the walker. The Queue scripting framework contains several mechanisms which make it easy to chain together GATK walkers.

2. Authoring walkers

As part of authoring your walker there are several Queue behaviors that you can specify for QScript authors using your particular walker.

Specifying how to partition

Queue can significantly speed up generating walker outputs by passing different instances of the GATK the same BAM or VCF data but specifying different regions of the data to analyze. After the different instances output their individual results Queue will gather the results back to the original output path requested by QScript.

Queue limits the level it will split genomic data by examining the @PartitionBy() annotation for your walker which specifies a PartitionType. This table lists the different partition types along with the default partition level for each of the different walker types.

PartitionType Default for Walker Type Description Example Intervals Example Splits
PartitionType.CONTIG Read walkers Data is grouped together so that all genomic data from the same contig is never presented to two different instances of the GATK. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 split 1: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60; split 2:chr3:10-11
PartitionType.INTERVAL (none) Data is split down to the interval level but never divides up an explicitly specified interval. If no explicit intervals are specified in the QScript for the GATK then this is effectively the same as splitting by contig. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 split 1: chr1:10-11, chr2:10-20, chr2:30-40; split 2: chr2:50-60, chr3:10-11
PartitionType.LOCUS Locus walkers, ROD walkers Data is split down to the locus level possibly dividing up intervals. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 split 1: chr1:10-11, chr2:10-20, chr2:30-35; split 2: chr2:36-40, chr2:50-60, chr3:10-11
PartitionType.NONE Read pair walkers, Duplicate walkers The data cannot be split and Queue must run the single instance of the GATK as specified in the QScript. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 no split: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11

If you walker is implemented in a way that Queue should not divide up your data you should explicitly set the @PartitionBy(PartitionType.NONE). If your walker can theoretically be run per genome location specify @PartitionBy(PartitionType.LOCUS).

@PartitionBy(PartitionType.LOCUS)
public class ExampleWalker extends LocusWalker<Integer, Integer> {
...

Specifying how to join outputs

Queue will join the standard walker outputs.

Output type Default gatherer implementation
SAMFileWriter The BAM files are joined together using Picard's MergeSamFiles.
VCFWriter The VCF files are joined together using the GATK CombineVariants.
PrintStream The first two files are scanned for a common header. The header is written once into the output, and then each file is appended to the output, skipping past with the header lines.

If your PrintStream is not a simple text file that can be concatenated together, you must implement a Gatherer. Extend your custom Gatherer from the abstract base class and implement the gather() method.

package org.broadinstitute.sting.commandline;

import java.io.File;
import java.util.List;

/**
 * Combines a list of files into a single output.
 */
public abstract class Gatherer {
    /**
     * Gathers a list of files into a single output.
     * @param inputs Files to combine.
     * @param output Path to output file.
     */
    public abstract void gather(List<File> inputs, File output);

    /**
     * Returns true if the caller should wait for the input files to propagate over NFS before running gather().
     */
    public boolean waitForInputs() { return true; }
}

Specify your gatherer using the @Gather() annotation by your @Output.

@Output
@Gather(MyGatherer.class)
public PrintStream out;

Queue will run your custom gatherer to join the intermediate outputs together.

3. Using GATK walkers in Queue

Queue GATK Extensions

Running 'ant queue' builds a set of Queue extensions for the GATK-Engine. Every GATK walker and command line program in the compiled GenomeAnalysisTK.jar a Queue compatible wrapper is generated.

The extensions can be imported via import org.broadinstitute.sting.queue.extensions.gatk._

import org.broadinstitute.sting.queue.QScript
import org.broadinstitute.sting.queue.extensions.gatk._

class MyQscript extends QScript {
...

Note that the generated GATK extensions will automatically handle shell-escaping of all values assigned to the various Walker parameters, so you can rest assured that all of your values will be taken literally by the shell. Do not attempt to escape values yourself -- ie.,

Do this:

filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0")

NOT this:

filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"")

Listing variables

In addition to the GATK documentation on this wiki you can also find the full list of arguments for each walker extension in a variety of ways.

The source code for the extensions is generated during ant queue and placed in this directory:

build/queue-extensions/src

When properly configured an IDE can provide command completion of the walker extensions. See Queue with IntelliJ IDEA for our recommended settings.

If you do not have access to an IDE you can still find the names of the generated variables using the command line. The generated variable names on each extension are based off of the fullName of the Walker argument. To see the built in documentation for each Walker, run the GATK with:

java -jar GenomeAnalysisTK.jar -T <walker name> -help

Once the import statement is specified you can add() instances of gatk extensions in your QScript's script() method.

Setting variables

If the GATK walker input allows more than one of a value you should specify the values as a List().

  def script() {
    val snps = new UnifiedGenotyper
    snps.reference_file = new File("testdata/exampleFASTA.fasta")
    snps.input_file = List(new File("testdata/exampleBAM.bam"))
    snps.out = new File("snps.vcf")
    add(snps)
  }

Although it may be harder for others trying to read your QScript, for each of the long name arguments the extensions contain aliases to their short names as well.

  def script() {
    val snps = new UnifiedGenotyper
    snps.R = new File("testdata/exampleFASTA.fasta")
    snps.I = List(new File("testdata/exampleBAM.bam"))
    snps.out = new File("snps.vcf")
    add(snps)
  }

Here are a few more examples using various list assignment operators.

  def script() {
    val countCovariates = new CountCovariates

    // Append to list using item appender :+
    countCovariates.rodBind :+= RodBind("dbsnp", "VCF", dbSNP)

    // Append to list using collection appender ++
    countCovariates.covariate ++= List("ReadGroupCovariate", "QualityScoreCovariate", "CycleCovariate", "DinucCovariate")

    // Assign list using plain old object assignment
    countCovariates.input_file = List(inBam)

    // The following is not a list, so just assigning one file to another
    countCovariates.recal_file = outRecalFile

    add(countCovariates)
  }

Specifying an alternate GATK jar

By default Queue runs the GATK from the current classpath. This works best since the extensions are generated and compiled at time same time the GATK is compiled via ant queue.

If you need to swap in a different version of the GATK you may not be able to use the generated extensions. The alternate GATK jar must have the same command line arguments as the GATK compiled with Queue. Otherwise the arguments will not match and you will get an error when Queue attempts to run the alternate GATK jar. In this case you will have to create your own custom CommandLineFunction for your analysis.

  def script {
    val snps = new UnifiedGenotyper
    snps.jarFile = new File("myPatchedGATK.jar")
    snps.reference_file = new File("testdata/exampleFASTA.fasta")
    snps.input_file = List(new File("testdata/exampleBAM.bam"))
    snps.out = new File("snps.vcf")
    add(snps)
  }

GATK scatter/gather

Queue currently allows QScript authors to explicitly invoke scatter/gather on GATK walkers by setting the scatter count on a function.

  def script {
    val snps = new UnifiedGenotyper
    snps.reference_file = new File("testdata/exampleFASTA.fasta")
    snps.input_file = List(new File("testdata/exampleBAM.bam"))
    snps.out = new File("snps.vcf")
    snps.scatterCount = 20
    add(snps)
  }

This will run the UnifiedGenotyper up to 20 ways parallel and then will merge the partial VCFs back into the single snps.vcf.

Additional caveat

Some walkers are still being updated to support Queue fully. For example they may not have defined the @Input and @Output and thus Queue is unable to correctly track their dependencies, or a custom Gatherer may not be implemented yet.

1. Basic QScript run rules

  • In the script method, a QScript will add one or more CommandLineFunctions.
  • Queue tracks dependencies between functions via variables annotated with @Input and @Output.
  • Queue will run functions based on the dependencies between them, so if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.

2. Command Line

Each CommandLineFunction must define the actual command line to run as follows.

class MyCommandLine extends CommandLineFunction {
  def commandLine = "myScript.sh hello world"
}

Constructing a Command Line Manually

If you're writing a one-off CommandLineFunction that is not destined for use by other QScripts, it's often easiest to construct the command line directly rather than through the API methods provided in the CommandLineFunction class.

For example:

def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out)

Constructing a Command Line using API Methods

If you're writing a CommandLineFunction that will become part of Queue and/or will be used by other QScripts, however, our best practice recommendation is to construct your command line only using the methods provided in the CommandLineFunction class: required(), optional(), conditional(), and repeat()

The reason for this is that these methods automatically escape the values you give them so that they'll be interpreted literally within the shell scripts Queue generates to run your command, and they also manage whitespace separation of command-line tokens for you. This prevents (for example) a value like MQ > 10 from being interpreted as an output redirection by the shell, and avoids issues with values containing embedded spaces. The methods also give you the ability to turn escaping and/or whitespace separation off as needed. An example:

override def commandLine = super.commandLine +
                           required("eff") +
                           conditional(verbose, "-v") +
                           optional("-c", config) +
                           required("-i", "vcf") +
                           required("-o", "vcf") +
                           required(genomeVersion) +
                           required(inVcf) +
                           required(">", escape=false) +  // This will be shell-interpreted as an output redirection
                           required(outVcf)

The CommandLineFunctions built into Queue, including the CommandLineFunctions automatically generated for GATK Walkers, are all written using this pattern. This means that when you configure a GATK Walker or one of the other built-in CommandLineFunctions in a QScript, you can rely on all of your values being safely escaped and taken literally when the commands are run, including values containing characters that would normally be interpreted by the shell such as MQ > 10.

Below is a brief overview of the API methods available to you in the CommandLineFunction class for safely constructing command lines:

  • required()

Used for command-line arguments that are always present, e.g.:

required("-f", "filename")                              returns: " '-f' 'filename' "
required("-f", "filename", escape=false)                returns: " -f filename "
required("java")                                        returns: " 'java' "
required("INPUT=", "myBam.bam", spaceSeparated=false)   returns: " 'INPUT=myBam.bam' "
  • optional()

Used for command-line arguments that may or may not be present, e.g.:

optional("-f", myVar) behaves like required() if myVar has a value, but returns ""
if myVar is null/Nil/None
  • conditional()

Used for command-line arguments that should only be included if some condition is true, e.g.:

conditional(verbose, "-v") returns " '-v' " if verbose is true, otherwise returns ""
  • repeat()

Used for command-line arguments that are repeated multiple times on the command line, e.g.:

repeat("-f", List("file1", "file2", "file3")) returns: " '-f' 'file1' '-f' 'file2' '-f' 'file3' "

3. Arguments

  • CommandLineFunction arguments use a similar syntax to arguments.

  • CommandLineFunction variables are annotated with @Input, @Output, or @Argument annotations.

Input and Output Files

So that Queue can track the input and output files of a command, CommandLineFunction @Input and @Output must be java.io.File objects.

class MyCommandLine extends CommandLineFunction {
  @Input(doc="input file")
  var inputFile: File = _
  def commandLine = "myScript.sh -fileParam " + inputFile
}

FileProvider

CommandLineFunction variables can also provide indirect access to java.io.File inputs and outputs via the FileProvider trait.

class MyCommandLine extends CommandLineFunction {
  @Input(doc="named input file")
  var inputFile: ExampleFileProvider = _
  def commandLine = "myScript.sh " + inputFile
}

// An example FileProvider that stores a 'name' with a 'file'.
class ExampleFileProvider(var name: String, var file: File) extends org.broadinstitute.sting.queue.function.FileProvider {
  override def toString = " -fileName " + name + " -fileParam " + file
}

Optional Arguments

Optional files can be specified via required=false, and can use the CommandLineFunction.optional() utility method, as described above:

class MyCommandLine extends CommandLineFunction {
  @Input(doc="input file", required=false)
  var inputFile: File = _
  // -fileParam will only be added if the QScript sets inputFile on this instance of MyCommandLine
  def commandLine = required("myScript.sh") + optional("-fileParam", inputFile)
}

Collections as Arguments

A List or Set of files can use the CommandLineFunction.repeat() utility method, as described above:

class MyCommandLine extends CommandLineFunction {
  @Input(doc="input file")
  var inputFile: List[File] = Nil // NOTE: Do not set List or Set variables to null!
  // -fileParam will added as many times as the QScript adds the inputFile on this instance of MyCommandLine
  def commandLine = required("myScript.sh") + repeat("-fileParam", inputFile)
}

Non-File Arguments

A command line function can define other required arguments via @Argument.

class MyCommandLine extends CommandLineFunction {
  @Argument(doc="message to display")
  var veryImportantMessage: String = _
  // If the QScript does not specify the required veryImportantMessage, the pipeline will not run.
  def commandLine = required("myScript.sh") + required(veryImportantMessage)
}

4. Example: "samtools index"

class SamToolsIndex extends CommandLineFunction {
  @Input(doc="bam to index") var bamFile: File = _
  @Output(doc="bam index") var baiFile: File = _
  def commandLine = "samtools index %s %s".format(bamFile, baiFile)
)

Or, using the CommandLineFunction API methods to construct the command line with automatic shell escaping:

class SamToolsIndex extends CommandLineFunction {
  @Input(doc="bam to index") var bamFile: File = _
  @Output(doc="bam index") var baiFile: File = _
  def commandLine = required("samtools") + required("index") + required(bamFile) + required(baiFile)
)

1. Introduction

Queue pipelines are Scala 2.8 files with a bit of syntactic sugar, called QScripts. Check out the following as references.

  • http://programming-scala.labs.oreilly.com
  • http://www.scala-lang.org/docu/files/ScalaByExample.pdf
  • http://davetron5000.github.com/scala-style/index.html

QScripts are easiest to develop using an Integrated Development Environment. See Queue with IntelliJ IDEA for our recommended settings.

The following is a basic outline of a QScript:

import org.broadinstitute.sting.queue.QScript
// List other imports here

// Define the overall QScript here.
class MyScript extends QScript {
  // List script arguments here.
  @Input(doc="My QScript inputs")
  var scriptInput: File = _

  // Create and add the functions in the script here.
  def script = {
     var myCL = new MyCommandLine
     myCL.myInput = scriptInput // Example variable input
     myCL.myOutput = new File("/path/to/output") // Example hardcoded output
     add(myCL)
  }

}

2. Imports

Imports can be any scala or java imports in scala syntax.

import java.io.File
import scala.util.Random
import org.favorite.my._
// etc.

3. Classes

  • To add a CommandLineFunction to a pipeline, a class must be defined that extends QScript.

  • The QScript must define a method script.

  • The QScript can define helper methods or variables.

4. Script method

The body of script should create and add Queue CommandlineFunctions.

class MyScript extends org.broadinstitute.sting.queue.QScript {
  def script = add(new CommandLineFunction { def commandLine = "echo hello world" })
}

5. Command Line Arguments

  • A QScript canbe set to read command line arguments by defining variables with @Input, @Output, or @Argument annotations.

  • A command line argument can be a primitive scalar, enum, File, or scala immutable Array, List, Set, or Option of a primitive, enum, or File.

  • QScript command line arguments can be marked as optional by setting required=false.

    class MyScript extends org.broadinstitute.sting.queue.QScript { @Input(doc="example message to echo") var message: String = _ def script = add(new CommandLineFunction { def commandLine = "echo " + message }) }

6. Using and writing CommandLineFunctions

Adding existing GATK walkers

See Pipelining the GATK using Queue for more information on the automatically generated Queue wrappers for GATK walkers.

After functions are defined they should be added to the QScript pipeline using add().

for (vcf <- vcfs) {
  val ve = new VariantEval
  ve.vcfFile = vcf
  ve.evalFile = swapExt(vcf, "vcf", "eval")
  add(ve)
}

Defining new CommandLineFunctions

  • Queue tracks dependencies between functions via variables annotated with @Input and @Output.

  • Queue will run functions based on the dependencies between them, not based on the order in which they are added in the script! So if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.

  • See the main article Queue CommandLineFunctions for more information.

7. Examples

  • The latest version of the example files are available in the Sting git repository under public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/.

  • To print the list of arguments required by an existing QScript run with -help.

  • To check if your script has all of the CommandLineFunction variables set correctly, run without -run.
  • When you are ready to execute the full pipeline, add -run.

Hello World QScript

The following is a "hello world" example that runs a single command line to echo hello world.

import org.broadinstitute.sting.queue.QScript

class HelloWorld extends QScript {
  def script = {
    add(new CommandLineFunction {
      def commandLine = "echo hello world"
    })
  }
}

The above file is checked into the Sting git repository under HelloWorld.scala. After building Queue from source, the QScript can be run with the following command:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run

It should produce output similar to:

INFO  16:23:27,825 QScriptManager - Compiling 1 QScript 
INFO  16:23:31,289 QScriptManager - Compilation complete 
INFO  16:23:34,631 HelpFormatter - --------------------------------------------------------- 
INFO  16:23:34,631 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine 
INFO  16:23:34,632 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run  
INFO  16:23:34,632 HelpFormatter - Date/Time: 2011/01/14 16:23:34 
INFO  16:23:34,632 HelpFormatter - --------------------------------------------------------- 
INFO  16:23:34,632 HelpFormatter - --------------------------------------------------------- 
INFO  16:23:34,634 QCommandLine - Scripting HelloWorld 
INFO  16:23:34,651 QCommandLine - Added 1 functions 
INFO  16:23:34,651 QGraph - Generating graph. 
INFO  16:23:34,660 QGraph - Running jobs. 
INFO  16:23:34,689 ShellJobRunner - Starting: echo hello world 
INFO  16:23:34,689 ShellJobRunner - Output written to /Users/kshakir/src/Sting/Q-43031@bmef8-d8e-1.out 
INFO  16:23:34,771 ShellJobRunner - Done: echo hello world 
INFO  16:23:34,773 QGraph - Deleting intermediate files. 
INFO  16:23:34,773 QCommandLine - Done 

ExampleUnifiedGenotyper.scala

This example uses automatically generated Queue compatible wrappers for the GATK. See Pipelining the GATK using Queue for more info on authoring Queue support into walkers and using walkers in Queue.

The ExampleUnifiedGenotyper.scala for running the UnifiedGenotyper followed by VariantFiltration can be found in the examples folder.

To list the command line parameters, including the required parameters, run with -help.

java -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -help

The help output should appear similar to this:

INFO  10:26:08,491 QScriptManager - Compiling 1 QScript
INFO  10:26:11,926 QScriptManager - Compilation complete
---------------------------------------------------------
Program Name: org.broadinstitute.sting.queue.QCommandLine
---------------------------------------------------------
---------------------------------------------------------
usage: java -jar Queue.jar -S <script> [-run] [-jobRunner <job_runner>] [-bsub] [-status] [-retry <retry_failed>]
       [-startFromScratch] [-keepIntermediates] [-statusTo <status_email_to>] [-statusFrom <status_email_from>] [-dot
       <dot_graph>] [-expandedDot <expanded_dot_graph>] [-jobPrefix <job_name_prefix>] [-jobProject <job_project>] [-jobQueue
       <job_queue>] [-jobPriority <job_priority>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
       <temp_directory>] [-jobSGDir <job_scatter_gather_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>]
       [-emailTLS] [-emailSSL] [-emailUser <emailUsername>] [-emailPassFile <emailPasswordFile>] [-emailPass <emailPassword>]
       [-l <logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h] -R <referencefile> -I <bamfile> [-L <intervals>]
       [-filter <filternames>] [-filterExpression <filterexpressions>]

 -S,--script <script>                                                      QScript scala file
 -run,--run_scripts                                                        Run QScripts.  Without this flag set only
                                                                           performs a dry run.
 -jobRunner,--job_runner <job_runner>                                      Use the specified job runner to dispatch
                                                                           command line jobs
 -bsub,--bsub                                                              Equivalent to -jobRunner Lsf706
 -status,--status                                                          Get status of jobs for the qscript
 -retry,--retry_failed <retry_failed>                                      Retry the specified number of times after a
                                                                           command fails.  Defaults to no retries.
 -startFromScratch,--start_from_scratch                                    Runs all command line functions even if the
                                                                           outputs were previously output successfully.
 -keepIntermediates,--keep_intermediate_outputs                            After a successful run keep the outputs of
                                                                           any Function marked as intermediate.
 -statusTo,--status_email_to <status_email_to>                             Email address to send emails to upon
                                                                           completion or on error.
 -statusFrom,--status_email_from <status_email_from>                       Email address to send emails from upon
                                                                           completion or on error.
 -dot,--dot_graph <dot_graph>                                              Outputs the queue graph to a .dot file.  See:
                                                                           http://en.wikipedia.org/wiki/DOT_language
 -expandedDot,--expanded_dot_graph <expanded_dot_graph>                    Outputs the queue graph of scatter gather to
                                                                           a .dot file.  Otherwise overwrites the
                                                                           dot_graph
 -jobPrefix,--job_name_prefix <job_name_prefix>                            Default name prefix for compute farm jobs.
 -jobProject,--job_project <job_project>                                   Default project for compute farm jobs.
 -jobQueue,--job_queue <job_queue>                                         Default queue for compute farm jobs.
 -jobPriority,--job_priority <job_priority>                                Default priority for jobs.
 -memLimit,--default_memory_limit <default_memory_limit>                   Default memory limit for jobs, in gigabytes.
 -runDir,--run_directory <run_directory>                                   Root directory to run functions from.
 -tempDir,--temp_directory <temp_directory>                                Temp directory to pass to functions.
 -jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory>   Default directory to place scatter gather
                                                                           output for compute farm jobs.
 -emailHost,--emailSmtpHost <emailSmtpHost>                                Email SMTP host. Defaults to localhost.
 -emailPort,--emailSmtpPort <emailSmtpPort>                                Email SMTP port. Defaults to 465 for ssl,
                                                                           otherwise 25.
 -emailTLS,--emailUseTLS                                                   Email should use TLS. Defaults to false.
 -emailSSL,--emailUseSSL                                                   Email should use SSL. Defaults to false.
 -emailUser,--emailUsername <emailUsername>                                Email SMTP username. Defaults to none.
 -emailPassFile,--emailPasswordFile <emailPasswordFile>                    Email SMTP password file. Defaults to none.
 -emailPass,--emailPassword <emailPassword>                                Email SMTP password. Defaults to none. Not
                                                                           secure! See emailPassFile.
 -l,--logging_level <logging_level>                                        Set the minimum level of logging, i.e.
                                                                           setting INFO get's you INFO up to FATAL,
                                                                           setting ERROR gets you ERROR and FATAL level
                                                                           logging.
 -log,--log_to_file <log_to_file>                                          Set the logging location
 -quiet,--quiet_output_mode                                                Set the logging to quiet mode, no output to
                                                                           stdout
 -debug,--debug_mode                                                       Set the logging file string to include a lot
                                                                           of debugging information (SLOW!)
 -h,--help                                                                 Generate this help message

Arguments for ExampleUnifiedGenotyper:
 -R,--referencefile <referencefile>                          The reference file for the bam files.
 -I,--bamfile <bamfile>                                      Bam file to genotype.
 -L,--intervals <intervals>                                  An optional file with a list of intervals to proccess.
 -filter,--filternames <filternames>                         A optional list of filter names.
 -filterExpression,--filterexpressions <filterexpressions>   An optional list of filter expressions.


##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.commandline.MissingArgumentException:
Argument with name '--bamfile' (-I) is missing.
Argument with name '--referencefile' (-R) is missing.
        at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:192)
        at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:172)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:199)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:57)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 1.0.5504):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Argument with name '--bamfile' (-I) is missing.
##### ERROR Argument with name '--referencefile' (-R) is missing.
##### ERROR ------------------------------------------------------------------------------------------

To dry run the pipeline:

java \
  -Djava.io.tmpdir=tmp \
  -jar dist/Queue.jar \
  -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala \
  -R human_b36_both.fasta \
  -I pilot2_daughters.chr20.10k-11k.bam \
  -L chr20.interval_list \
  -filter StrandBias -filterExpression "SB>=0.10" \
  -filter AlleleBalance -filterExpression "AB>=0.75" \
  -filter QualByDepth -filterExpression "QD<5" \
  -filter HomopolymerRun -filterExpression "HRun>=4"

The dry run output should appear similar to this:

INFO  10:45:00,354 QScriptManager - Compiling 1 QScript
INFO  10:45:04,855 QScriptManager - Compilation complete
INFO  10:45:05,058 HelpFormatter - ---------------------------------------------------------
INFO  10:45:05,059 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine
INFO  10:45:05,059 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -R human_b36_both.fasta -I pilot2_daughters.chr20.10k-11k.bam -L chr20.interval_list -filter StrandBias -filterExpression SB>=0.10 -filter AlleleBalance -filterExpression AB>=0.75 -filter QualByDepth -filterExpression QD<5 -filter HomopolymerRun -filterExpression HRun>=4 
INFO  10:45:05,059 HelpFormatter - Date/Time: 2011/03/24 10:45:05
INFO  10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO  10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO  10:45:05,061 QCommandLine - Scripting ExampleUnifiedGenotyper
INFO  10:45:05,150 QCommandLine - Added 4 functions
INFO  10:45:05,150 QGraph - Generating graph.
INFO  10:45:05,169 QGraph - Generating scatter gather jobs.
INFO  10:45:05,182 QGraph - Removing original jobs.
INFO  10:45:05,183 QGraph - Adding scatter gather jobs.
INFO  10:45:05,231 QGraph - Regenerating graph.
INFO  10:45:05,247 QGraph - -------
INFO  10:45:05,252 QGraph - Pending: IntervalScatterFunction /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals
INFO  10:45:05,253 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/scatter/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,254 QGraph - -------
INFO  10:45:05,279 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO  10:45:05,279 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,279 QGraph - -------
INFO  10:45:05,283 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO  10:45:05,283 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,283 QGraph - -------
INFO  10:45:05,287 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO  10:45:05,287 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,288 QGraph - -------
INFO  10:45:05,288 QGraph - Pending: SimpleTextGatherFunction /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,288 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-jobOutputFile/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,289 QGraph - -------
INFO  10:45:05,291 QGraph - Pending: java -Xmx1g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T CombineVariants -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:input0,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input1,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input2,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -priority input0,input1,input2 -assumeIdenticalSamples
INFO  10:45:05,291 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-out/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,292 QGraph - -------
INFO  10:45:05,296 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.eval
INFO  10:45:05,296 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-2.out
INFO  10:45:05,296 QGraph - -------
INFO  10:45:05,299 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantFiltration -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:vcf,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -filter SB>=0.10 -filter AB>=0.75 -filter QD<5 -filter HRun>=4 -filterName StrandBias -filterName AlleleBalance -filterName QualByDepth -filterName HomopolymerRun
INFO  10:45:05,299 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-3.out
INFO  10:45:05,302 QGraph - -------
INFO  10:45:05,303 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.eval
INFO  10:45:05,303 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-4.out
INFO  10:45:05,304 QGraph - Dry run completed successfully!
INFO  10:45:05,304 QGraph - Re-run with "-run" to execute the functions.
INFO  10:45:05,304 QCommandLine - Done

8. Using traits to pass common values between QScripts to CommandLineFunctions

QScript files often create multiple CommandLineFunctions with similar arguments. Use various scala tricks such as inner classes, traits / mixins, etc. to reuse variables.

  • A self type can be useful to distinguish between this. We use qscript as an alias for the QScript's this to distinguish from the this inside of inner classes or traits.

  • A trait mixin can be used to reuse functionality. The trait below is designed to copy values from the QScript and then is mixed into different instances of the functions.

See the following example:

class MyScript extends org.broadinstitute.sting.queue.QScript {
  // Create an alias 'qscript' for 'MyScript.this'
  qscript =>

  // This is a script argument
  @Argument(doc="message to display")
  var message: String = _

  // This is a script argument
  @Argument(doc="number of times to display")
  var count: Int = _

  trait ReusableArguments extends MyCommandLineFunction {
    // Whenever a function is created 'with' this trait, it will copy the message.
    this.commandLineMessage = qscript.message
  }

  abstract class MyCommandLineFunction extends CommandLineFunction {
     // This is a per command line argument
     @Argument(doc="message to display")
     var commandLineMessage: String = _
  }

  class MyEchoFunction extends MyCommandLineFunction {
     def commandLine = "echo " + commandLineMessage
  }

  class MyAlsoEchoFunction extends MyCommandLineFunction {
     def commandLine = "echo also " + commandLineMessage
  }

  def script = {
    for (i <- 1 to count) {
      val echo = new MyEchoFunction with ReusableArguments
      val alsoEcho = new MyAlsoEchoFunction with ReusableArguments
      add(echo, alsoEcho)
    }
  }
}

VariantRecalibrator

For use with calls generated by the UnifiedGenotyper and HaplotypeCaller

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Common, base command line

java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R path/to/reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -nt 4 \
   [SPECIFY TRUTH AND TRAINING SETS] \
   [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
   [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle. Arguments for VariantRecalibrator command:

   -percentBad 0.01 -minNumBad 1000 \
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP \
   -mode SNP \

Note that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

Important notes about annotations

Some of these annotations might not be the best for your particular dataset. For example, InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included.

Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with hybrid capture datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

Additionally, the UnifiedGenotyper produces a statistic called the HaplotypeScore which should be used for SNPs. This statistic isn't necessary for the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes.

Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

  • Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline)
  • Use the VQSR with the smaller variant callset but experiment with the precise argument settings (try adding --maxGaussians 4 --percentBad 0.05 to your command line, for example)

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle. Arguments for VariantRecalibrator:

   --maxGaussians 4 -percentBad 0.01 -minNumBad 1000 \
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an DP -an FS -an ReadPosRankSum -an MQRankSum \
   -mode INDEL \

Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included.

ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

For use with calls generated by the UnifiedGenotyper and HaplotypeCaller

Common, base command line

 
 java -Xmx3g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -tranchesFile path/to/input.tranches \
   -recalFile path/to/input.recal \
   -o path/to/output.recalibrated.filtered.vcf \
   [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
   [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \
 

SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. The default recommendation is to achieve 99.9% sensitivity to the accessible truth sites. Naturally projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity.

   --ts_filter_level 99.9 \
   -mode SNP \

Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. The default recommendation is to achieve 99.9% sensitivity to the accessible truth sites. Naturally projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity.

   --ts_filter_level 99.9 \
   -mode INDEL \

1. Reference Sequence

The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. Only the standard ACGT bases are accepted; no non-standard bases (W, for example) are tolerated. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see this article for more information on preparing FASTA reference sequences for use with the GATK.

Human sequence

If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.

Our Best Practice recommendation is that you use a standard GATK reference from the GATK resource bundle.

2. Sequencing Reads

The only input format for NGS reads that the GATK supports is the [Sequence Alignment/Map (SAM)] format. See [SAM/BAM] for more details on the SAM/BAM format as well as Samtools and Picard, two complementary sets of utilities for working with SAM/BAM files.

In addition to being in SAM format, we require the following additional constraints in order to use your file with the GATK:

  • The file must be binary (with .bam file extension).
  • The file must be indexed.
  • The file must be sorted in coordinate order with respect to the reference (i.e. the contig ordering in your bam must exactly match that of the reference you are using).
  • The file must have a proper bam header with read groups. Each read group must contain the platform (PL) and sample (SM) tags. For the platform value, we currently support 454, LS454, Illumina, Solid, ABI_Solid, and CG (all case-insensitive).
  • Each read in the file must be associated with exactly one read group.

Below is an example well-formed SAM field header and fields from the 1000 Genomes Project:

@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:249250621    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ     SN:2    LN:243199373    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:a0d9851da00400dec1098a9255ac712e
@SQ     SN:3    LN:198022430    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:fdfd811849cc2fadebc929bb925902e5
@SQ     SN:4    LN:191154276    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:23dccd106897542ad87d2765d28a19a1
@SQ     SN:5    LN:180915260    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:0740173db9ffd264d728f32784845cd7
@SQ     SN:6    LN:171115067    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:1d3a93a248d92a729ee764823acbbc6b
@SQ     SN:7    LN:159138663    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:618366e953d6aaad97dbe4777c29375e
@SQ     SN:8    LN:146364022    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:96f514a9929e410c6651697bded59aec
@SQ     SN:9    LN:141213431    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:3e273117f15e0a400f01055d9f393768
@SQ     SN:10   LN:135534747    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:988c28e000e84c26d552359af1ea2e1d
@SQ     SN:11   LN:135006516    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:98c59049a2df285c76ffb1c6db8f8b96
@SQ     SN:12   LN:133851895    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:51851ac0e1a115847ad36449b0015864
@SQ     SN:13   LN:115169878    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:283f8d7892baa81b510a015719ca7b0b
@SQ     SN:14   LN:107349540    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:98f3cae32b2a2e9524bc19813927542e
@SQ     SN:15   LN:102531392    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:e5645a794a8238215b2cd77acb95a078
@SQ     SN:16   LN:90354753     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:fc9b1a7b42b97a864f56b348b06095e6
@SQ     SN:17   LN:81195210     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:351f64d4f4f9ddd45b35336ad97aa6de
@SQ     SN:18   LN:78077248     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c
@SQ     SN:19   LN:59128983     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:1aacd71f30db8e561810913e0b72636d
@SQ     SN:20   LN:63025520     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:0dec9660ec1efaaf33281c0d5ea2560f
@SQ     SN:21   LN:48129895     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:2979a6085bfe28e3ad6f552f361ed74d
@SQ     SN:22   LN:51304566     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:a718acaa6135fdca8357d5bfe94211dd
@SQ     SN:X    LN:155270560    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:7e0e2e580297b7764e31dbc80c2540dd
@SQ     SN:Y    LN:59373566     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:1fa3474750af0948bdf97d5a0ee52e51
@SQ     SN:MT   LN:16569        AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:c68f52674c9fb33aef52dcf399755519
@RG     ID:ERR000162    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR000252    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001684    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001685    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001686    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001687    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001688    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001689    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001690    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002307    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002308    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002309    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002310    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002311    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002312    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002313    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002434    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@PG     ID:GATK TableRecalibration      VN:v2.2.16      CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, DinucCovariate, CycleCovariate], use_original_quals=true, defau 
t_read_group=DefaultReadGroup, default_platform=Illumina, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, except on_if_no_tile=false, pQ=5, maxQ=40, smoothing=137       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:b4eb71ee878d3706246b7c1dbef69299
@PG     ID:bwa  VN:0.5.5
ERR001685.4315085       16      1       9997    25      35M     *       0       0       CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT     ?8:C7ACAABBCBAAB?CCAABBEBA@ACEBBB@?     XT:A:U  XN:i:4    X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  RG:Z:ERR001685  NM:i:6  MD:Z:0N0N0N0N1A0A28     OQ:Z:>>:>2>>>>>>>>>>>>>>>>>>?>>>>??>???>
ERR001689.1165834       117     1       9997    0       *       =       9997    0       CCGATCTAGGGTTAGGGTTAGGGTTAGGGTTAGGG     >7AA<@@C?@?B?B??>9?B??>A?B???BAB??@     RG:Z:ERR001689    OQ:Z:>:<<8<<<><<><><<>7<>>>?>>??>???????
ERR001689.1165834       185     1       9997    25      35M     =       9997    0       CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT     758A:?>>8?=@@>>?;4<>=??@@==??@?==?8     XT:A:U  XN:i:4    SM:i:25 AM:i:0  X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  RG:Z:ERR001689  NM:i:6  MD:Z:0N0N0N0N1A0A28     OQ:Z:;74>7><><><>>>>><:<>>>>>>>>>>>>>>>>
ERR001688.2681347       117     1       9998    0       *       =       9998    0       CGATCTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG     5@BA@A6B???A?B??>B@B??>B@B??>BAB???     RG:Z:ERR001688    OQ:Z:=>>>><4><<?><??????????????????????       

Fixing BAM files with alternative sortings

The GATK requires that the BAM file be sorted in the same order as the reference. Unfortunately, many BAM files have headers that are sorted in some other order -- lexicographical order is a common alternative. To resort the BAM file please use ReorderSam.

3. Intervals

The GATK accept interval files for processing subsets of the genome in Picard-style interval lists. These files have a .interval_list extension and look like this:

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
@SQ     SN:3    LN:198022430    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:fdfd811849cc2fadebc929bb925902e5     SP:Homo Sapiens
@SQ     SN:4    LN:191154276    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:23dccd106897542ad87d2765d28a19a1     SP:Homo Sapiens
@SQ     SN:5    LN:180915260    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:0740173db9ffd264d728f32784845cd7     SP:Homo Sapiens
@SQ     SN:6    LN:171115067    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1d3a93a248d92a729ee764823acbbc6b     SP:Homo Sapiens
@SQ     SN:7    LN:159138663    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:618366e953d6aaad97dbe4777c29375e     SP:Homo Sapiens
@SQ     SN:8    LN:146364022    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:96f514a9929e410c6651697bded59aec     SP:Homo Sapiens
@SQ     SN:9    LN:141213431    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:3e273117f15e0a400f01055d9f393768     SP:Homo Sapiens
@SQ     SN:10   LN:135534747    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:988c28e000e84c26d552359af1ea2e1d     SP:Homo Sapiens
@SQ     SN:11   LN:135006516    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:98c59049a2df285c76ffb1c6db8f8b96     SP:Homo Sapiens
@SQ     SN:12   LN:133851895    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:51851ac0e1a115847ad36449b0015864     SP:Homo Sapiens
@SQ     SN:13   LN:115169878    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:283f8d7892baa81b510a015719ca7b0b     SP:Homo Sapiens
@SQ     SN:14   LN:107349540    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:98f3cae32b2a2e9524bc19813927542e     SP:Homo Sapiens
@SQ     SN:15   LN:102531392    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:e5645a794a8238215b2cd77acb95a078     SP:Homo Sapiens
@SQ     SN:16   LN:90354753     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:fc9b1a7b42b97a864f56b348b06095e6     SP:Homo Sapiens
@SQ     SN:17   LN:81195210     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:351f64d4f4f9ddd45b35336ad97aa6de     SP:Homo Sapiens
@SQ     SN:18   LN:78077248     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c     SP:Homo Sapiens
@SQ     SN:19   LN:59128983     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1aacd71f30db8e561810913e0b72636d     SP:Homo Sapiens
@SQ     SN:20   LN:63025520     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:0dec9660ec1efaaf33281c0d5ea2560f     SP:Homo Sapiens
@SQ     SN:21   LN:48129895     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:2979a6085bfe28e3ad6f552f361ed74d     SP:Homo Sapiens
@SQ     SN:22   LN:51304566     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a718acaa6135fdca8357d5bfe94211dd     SP:Homo Sapiens
@SQ     SN:X    LN:155270560    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:7e0e2e580297b7764e31dbc80c2540dd     SP:Homo Sapiens
@SQ     SN:Y    LN:59373566     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1fa3474750af0948bdf97d5a0ee52e51     SP:Homo Sapiens
@SQ     SN:MT   LN:16569        AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:c68f52674c9fb33aef52dcf399755519     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6
...

consisting of a SAM-file-like sequence dictionary (the header), and targets in the form of + . These interval lists are tab-delimited. They are also 1-based (first position in the genome is position 1, not position 0). The easiest way to create such a file is to combine your reference file's sequence dictionary (the file stored alongside the reference fasta file with the .dict extension) and your intervals into one file.

You can also specify a list of intervals in a .interval_list file formatted as :- (one interval per line). No sequence dictionary is necessary. This file uses 1-based coordinates.

Finally, we also accept BED style interval lists. Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1.

4. Reference Ordered Data (ROD) file formats

The GATK can associate arbitrary reference ordered data (ROD) files with named tracks for all tools. Some tools require specific ROD data files for processing, and developers are free to write tools that access arbitrary data sets using the ROD interface. The general ROD system has the following syntax:

-argumentName:name,type file

Where name is the name in the GATK tool (like "eval" in VariantEval), type is the type of the file, such as VCF or dbSNP, and file is the path to the file containing the ROD data.

The GATK supports several common file formats for reading ROD data:

  • VCF : VCF type, the recommended format for representing variant loci and genotype calls. The GATK will only process valid VCF files; VCFTools provides the official VCF validator. See [here] for a useful poster detailing the VCF specification.
  • UCSC formated dbSNP : dbSNP type, UCSC dbSNP database output
  • BED : BED type, a general purpose format for representing genomic interval data, useful for masks and other interval outputs. Please note that the bed format is 0-based while most other formats are 1-based.

Note that we no longer support the PED format. See here for converting .ped files to VCF.

You probably know by now that GATK-Lite is a free-for-everyone and completely open-source version of the GATK (licensed under the original MIT license).

But what's in the box? What can GATK-Lite do -- or rather, what can it not do that the full version (let's call it GATK-Full) can? And what does that mean exactly, in terms of functionality, reliability and power?

To really understand the differences between GATK-Lite and GATK-Full, you need some more information on how the GATK works, and how we work to develop and improve it.

First you need to understand what are the two core components of the GATK: the engine and tools (see picture below).

As explained here, the engine handles all the common work that's related to data access, conversion and traversal, as well as high-performance computing features. The engine is supported by an infrastructure of software libraries. If the GATK was a car, that would be the engine and chassis. What we call the **tools* are attached on top of that, and they provide the various analytical and processing functionalities like variant calling and base or variant recalibration. On your car, that would be headlights, airbags and so on.

Core GATK components

Second is how we work on developing the GATK, and what it means for how improvements are shared (or not) between Lite and Full.

We do all our development work on a single codebase. This means that everything --the engine and all tools-- is on one common workbench. There are not different versions that we work on in parallel -- that would be crazy to manage! That's why the version numbers of GATK-Lite and GATK-Full always match: if the latest GATK-Full version is numbered 2.1-13, then the latest GATK-Lite is also numbered 2.1-13.

The most important consequence of this setup is that when we make improvements to the infrastructure and engine, the same improvements will end up in GATK Lite and in GATK Full. So for the purposes of power, speed and robustness of the GATK that is determined by the engine, there is no difference between them.

For the tools, it's a little more complicated -- but not much. When we "build" the GATK binaries (the .jar files), we put everything from the workbench into the Full build, but we only put a subset into the Lite build. Note that this Lite subset is pretty big -- it contains all the tools that were previously available in GATK 1.x versions, and always will. We also reserve the right to add previews or not-fully-featured versions of the new tools that are in Full, at our discretion, to the Lite build.

So there are two basic types of differences between the tools available in the Lite and Full builds (see picture below).

  1. We have a new tool that performs a brand new function (which wasn't available in GATK 1.x), and we only include it in the Full build.

  2. We have a tool that has some new add-on capabilities (which weren't possible in GATK 1.x); we put the tool in both the Lite and the Full build, but the add-ons are only available in the Full build.

Tools in Lite vs. Full

Reprising the car analogy, GATK-Lite and GATK-Full are like two versions of the same car -- the basic version and the fully-equipped one. They both have the exact same engine, and most of the equipment (tools) is the same -- for example, they both have the same airbag system, and they both have headlights. But there are a few important differences:

  1. The GATK-Full car comes with a GPS (sat-nav for our UK friends), for which the Lite car has no equivalent. You could buy a portable GPS unit from a third-party store for your Lite car, but it might not be as good, and certainly not as convenient, as the Full car's built-in one.

  2. Both cars have windows of course, but the Full car has power windows, while the Lite car doesn't. The Lite windows can open and close, but you have to operate them by hand, which is much slower.

So, to summarize:

The underlying engine is exactly the same in both GATK-Lite and GATK-Full. Most functionalities are available in both builds, performed by the same tools. Some functionalities are available in both builds, but they are performed by different tools, and the tool in the Full build is better. New, cutting-edge functionalities are only available in the Full build, and there is no equivalent in the Lite build.

We hope this clears up some of the confusion surrounding GATK-Lite. If not, please leave a comment and we'll do our best to clarify further!

Overview

One of the key challenges of working with next-gen sequence data is that input files are usually very large. We can’t just make the program open the files, load all the data into memory and perform whatever analysis is needed on all of it in one go. It’s just too much work, even for supercomputers.

Instead, we make the program cut the job into smaller tasks that the computer can easily process separately. Then we have it combine the results of each step into the final result.

Map/Reduce

Map/Reduce is the technique we use to achieve this. It consists of three steps formally called filter, map and reduce. Let’s apply it to an example case where we want to find out what is the average depth of coverage in our dataset for a certain region of the genome.

  • filter determines what subset of the data needs to be processed in each task. In our example, the program lists all the reference positions in our region of interest.

  • map applies the function, i.e. performs the analysis on each subset of data. In our example, for each position in the list, the program looks into the BAM file, pulls out the pileup of bases and outputs the depth of coverage at that position.

  • reduce combines the elements in the list of results output by the map function. In our example, the program takes the coverage numbers that were calculated separately for all the reference positions and calculates their average, which is the final result we want.

This may seem trivial for such a simple example, but it is a very powerful method with many advantages. Among other things, it makes it relatively easy to parallelize operations, which makes the tools run much faster on large datasets.

Walkers, filters and traversal types

All the tools in the GATK are built from the ground up to take advantage of this method. That’s why we call them walkers: because they “walk” across the genome, getting things done.

Note that even though it’s not included in the Map/Reduce technique’s name, the filter step is very important. It determines what data get presented to the tool for analysis, selecting only the appropriate data for each task and discarding anything that’s not relevant. This is a key part of the Map/Reduce technique, because that’s what makes each task “bite-sized” enough for the computer to handle easily.

Each tool has filters that are tailored specifically for the type of analysis it performs. The filters rely on traversal engines, which are little programs that are designed to “traverse” the data (i.e. walk through the data) in specific ways.

There are three major types of traversal: Locus Traversal, Read Traversal and Active Region Traversal. In our interval coverage example, the tool’s filter uses the Locus Traversal engine, which walks through the data by locus, i.e. by position along the reference genome. Because of that, the tool is classified as a Locus Walker. Similarly, the Read Traversal engine is used, you’ve guessed it, by Read Walkers.

The GATK engine comes packed with many other ways to walk through the genome and get the job done seamlessly, but those are the ones you’ll encounter most often.

Further reading

A primer on parallelism with the GATK How can I use parallelism to make GATK tools run faster?

A GATKReport is simply a text document that contains well-formatted, easy to read representation of some tabular data. Many GATK tools output their results as GATKReports, so it's important to understand how they are formatted and how you can use them in further analyses.

Here's a simple example:

#:GATKReport.v1.0:2
#:GATKTable:true:2:9:%.18E:%.15f:;
#:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads
cycle  errorrate.61PA8.7         qualavg.61PA8.7                                         
0      7.451835696110506E-3      25.474613284804366                                      
1      2.362777171937477E-3      29.844949954504095                                      
2      9.087604507451836E-4      32.875909752547310
3      5.452562704471102E-4      34.498999090081895                                      
4      9.087604507451836E-4      35.148316651501370                                       
5      5.452562704471102E-4      36.072234352256190                                       
6      5.452562704471102E-4      36.121724890829700                                        
7      5.452562704471102E-4      36.191048034934500                                        
8      5.452562704471102E-4      36.003457059679770                                       

#:GATKTable:false:2:3:%s:%c:;
#:GATKTable:TableName:Description
key    column
1:1000  T 
1:1001  A 
1:1002  C 

This report contains two individual GATK report tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data.

We provide an R library called gsalib that allows you to load GATKReport files into R for further analysis. Here are the five simple steps to getting gsalib, installing it and loading a report.

1. Get the GATK source code on GitHub

Please visit the Downloads page for instructions.

2. Compile the gsalib library

$ ant gsalib
Buildfile: build.xml

gsalib:
     [exec] * installing *source* package ?gsalib? ...
     [exec] ** R
     [exec] ** data
     [exec] ** preparing package for lazy loading
     [exec] ** help
     [exec] *** installing help indices
     [exec] ** building package indices ...
     [exec] ** testing if installed package can be loaded
     [exec] 
     [exec] * DONE (gsalib)

BUILD SUCCESSFUL

3. Tell R where to find the gsalib library by adding the path in your ~/.Rprofile (you may need to create this file if it doesn't exist)

$ cat .Rprofile 
.libPaths("/path/to/Sting/R/")

4. Start R and load the gsalib library

$ R

R version 2.11.0 (2010-04-22)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(gsalib)

5. Finally, load the GATKReport file and have fun

> d = gsa.read.gatkreport("/path/to/my.gatkreport")
> summary(d)
              Length Class      Mode
CountVariants 27     data.frame list
CompOverlap   13     data.frame list

1. Obtaining the bundle

Inside of the Broad, the latest bundle will always be available in:

/humgen/gsa-hpprojects/GATK/bundle/current

with a subdirectory containing for each reference sequence and associated data files.

External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).

2. b37 Resources: the Standard Data Set

  • Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
  • dbSNP in VCF. This includes two files:
    • The most recent dbSNP release
    • This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
  • HapMap genotypes and sites VCFs
  • OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
  • The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
    • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
    • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • A large-scale standard single sample BAM file for testing:
    • NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam containing ~64x reads of NA12878 on chromosome 20
    • The results of the latest UnifiedGenotyper with default arguments run on this data set (NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.vcf)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

3. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

4. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

5. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

The following links should be help as a review or an introduction to concepts and terminology related to next-generation sequencing:

1. Introduction

The core concept behind GATK tools is the walker, a class that implements the three core operations: filtering, mapping, and reducing.

  • filter Reduces the size of the dataset by applying a predicate.

  • map Applies a function to each individual element in a dataset, effectively mapping it to a new element.

  • reduce Inductively combines the elements of a list. The base case is supplied by the reduceInit() function, and the inductive step is performed by the reduce() function.

Users of the GATK will provide a walker to run their analyses. The engine will produce a result by first filtering the dataset, running a map operation, and finally reducing the map operation to a single result.

2. Creating a Walker

To be usable by the GATK, the walker must satisfy the following properties:

  • It must subclass one of the basic walkers in the org.broadinstitute.sting.gatk.walkers package, usually ReadWalker or LociWalker.

    • Locus walkers present all the reads, reference bases, and reference-ordered data that overlap a single base in the reference. Locus walkers are best used for analyses that look at each locus independently, such as genotyping.

    • Read walkers present only one read at a time, as well as the reference bases and reference-ordered data that overlap that read.

    • Besides read walkers and locus walkers, the GATK features several other data access patterns, described here.

  • The compiled class or jar must be on the current classpath. The Java classpath can be controlled using either the $CLASSPATH environment variable or the JVM's -cp option.

3. Examples

The best way to get started with the GATK is to explore the walkers we've written. Here are the best walkers to look at when getting started:

  • CountLoci

    It is the simplest locus walker in our codebase. It counts the number of loci walked over in a single run of the GATK.

$STING_HOME/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountLociWalker.java

  • CountReads

    It is the simplest read walker in our codebase. It counts the number of reads walked over in a single run of the GATK.

$STING_HOME/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountReadsWalker.java

  • GATKPaperGenotyper

    This is a more sophisticated example, taken from our recent paper in Genome Research (and using our ReadBackedPileup to select and filter reads). It is an extremely basic Bayesian genotyper that demonstrates how to output data to a stream and execute simple base operations.

$STING_HOME/java/src/org/broadinstitute/sting/gatk/examples/papergenotyper/GATKPaperGenotyper.java

Please note that the walker above is NOT the UnifiedGenotyper. While conceptually similar to the UnifiedGenotyper, the GATKPaperGenotyper uses a much simpler calling model for increased clarity and readability.

4. External walkers and the 'external' directory

The GATK can absorb external walkers placed in a directory of your choosing. By default, that directory is called 'external' and is relative to the Sting git root directory (for example, ~/src/Sting/external). However, you can choose to place that directory anywhere on the filesystem and specify its complete path using the ant external.dir property.

ant -Dexternal.dir=~/src/external

The GATK will check each directory under the external directory (but not the external directory itself!) for small build scripts. These build scripts must contain at least a compile target that compiles your walker and places the resulting class file into the GATK's class file output directory. The following is a sample compile target:

<target name="compile" depends="init">
    <javac srcdir="." destdir="${build.dir}" classpath="${gatk.classpath}" />
</target>

As a convenience, the build.dir ant property will be predefined to be the GATK's class file output directory and the gatk.classpath property will be predefined to be the GATK's core classpath. Once this structure is defined, any invocation of the ant build scripts will build the contents of the external directory as well as the GATK itself.

1. Install scala somewhere

At the Broad, we typically put it somewhere like this:

/home/radon01/depristo/work/local/scala-2.7.5.final

Next, create a symlink from this directory to trunk/scala/installation:

ln -s /home/radon01/depristo/work/local/scala-2.7.5.final trunk/scala/installation

2. Setting up your path

Right now the only way to get scala walkers into the GATK is by explicitly setting your CLASSPATH in your .my.cshrc file:

setenv CLASSPATH /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/FourBaseRecaller.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GenomeAnalysisTK.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/Playground.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/StingUtils.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/bcel-5.2.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/colt-1.2.0.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/google-collections-0.9.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/javassist-3.7.ga.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/junit-4.4.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/log4j-1.2.15.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/picard-1.02.63.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/picard-private-875.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/reflections-0.9.2.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/sam-1.01.63.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/simple-xml-2.0.4.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar

Really this needs to be manually updated whenever any of the libraries are updated. If you see this error:

Caused by: java.lang.RuntimeException: java.util.zip.ZipException: error in opening zip file
        at org.reflections.util.VirtualFile.iterable(VirtualFile.java:79)
        at org.reflections.util.VirtualFile$5.transform(VirtualFile.java:169)
        at org.reflections.util.VirtualFile$5.transform(VirtualFile.java:167)
        at org.reflections.util.FluentIterable$3.transform(FluentIterable.java:43)
        at org.reflections.util.FluentIterable$3.transform(FluentIterable.java:41)
        at org.reflections.util.FluentIterable$ForkIterator.computeNext(FluentIterable.java:81)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
        at org.reflections.util.FluentIterable$FilterIterator.computeNext(FluentIterable.java:102)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
        at org.reflections.util.FluentIterable$TransformIterator.computeNext(FluentIterable.java:124)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
        at org.reflections.Reflections.scan(Reflections.java:69)
        at org.reflections.Reflections.<init>(Reflections.java:47)
        at org.broadinstitute.sting.utils.PackageUtils.<clinit>(PackageUtils.java:23)

It's because the libraries aren't updated. Basically just do an ls of your trunk/dist directory after the GATK has been build, make this your classpath as above, and tack on:

/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar

A command that almost works (but you'll need to replace the spaces with colons) is:

#setenv CLASSPATH $CLASSPATH `ls /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/*.jar` /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar

3. Building scala code

All of the Scala source code lives in scala/src, which you build using ant scala

There are already some example Scala walkers in scala/src, so doing a standard checkout, installing scala, settting up your environment, should allow you to run something like:

gsa2 ~/dev/GenomeAnalysisTK/trunk > ant scala
Buildfile: build.xml

init.scala:

scala:
     [echo] Sting: Compiling scala!
   [scalac] Compiling 2 source files to /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/scala/classes
   [scalac] warning: there were deprecation warnings; re-run with -deprecation for details
   [scalac] one warning found
   [scalac] Compile suceeded with 1 warning; see the compiler output for details.
   [delete] Deleting: /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar
      [jar] Building jar: /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar

4. Invoking a scala walker

Until we can include Scala walkers along with the main GATK jar (avoiding the classpath issue too) you have to invoke your scala walkers using this syntax:

java -Xmx2048m org.broadinstitute.sting.gatk.CommandLineGATK -T BaseTransitionTableCalculator -R /broad/1KG/reference/human_b36_both.fasta -I /broad/1KG/DCC_merged/freeze5/NA12878.pilot2.SLX.bam -l INFO -L 1:1-100

Here, the BaseTransitionTableCalculator walker is written in Scala and being loaded into the system by the GATK walker manager. Otherwise everything looks like a normal GATK module.

Sorry, there are no publicly available documents of this type with the tag #basic. Try one of the other types.
Sorry, there are no publicly available documents of this type with the tag #basic. Try one of the other types.