GATK Best Practices
Recommended workflows for variant analysis with GATK



About the GATK Best Practices

The "GATK Best Practices" are workflow descriptions that provide step-by-step recommendations for getting the best analysis results possible out of high-throughput sequencing data. At present, we provide the following Best Practice workflows:

These recommendations have been developed by the GATK development team over years of analysis work on many of the Broad Institute's sequencing projects, and are applied in the Broad's production pipelines. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable.


Important notes on context and caveats

Our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. Unfortunately we are not able to provide official recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent).

In addition, the illustrations and tutorials provided in these pages tend to assume a simple experimental design where each sample is used to produce one DNA library that is sequenced separately on one lane of the machine. See the Guide for help dealing with other experimental designs.

Finally, please be aware that several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.




Frequently Asked Questions


Created 2013-08-02 19:12:19 | Updated 2013-08-02 19:12:40 | Tags: official terms
Comments (0)

There are four major organizational units for next-generation DNA sequencing processes that used throughout the GATK documentation:

  • Lane: The basic machine unit for sequencing. The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane.

  • Library: A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes.

  • Sample: A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Throughout our documentation, we treat samples as independent individuals whose genome sequence we are attempting to determine. Note that from this perspective, tumor / normal samples are different despite coming from the same individual.

  • Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends very specifically on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many deeply sequenced samples (e.g., ESP with 800 EOMI samples) we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.

Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost, since running these commands across all of your data simultaneously requires much more computing power. Please see the documentation for each step to understand what is the best way to group or partition your data for that particular process.


Created 2013-08-02 20:23:38 | Updated 2013-09-07 14:04:11 | Tags: official best-practices workflow
Comments (2)

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Let's say we have this example data:

  • sample1_lane1.fq
  • sample1_lane2.fq
  • sample2_lane1.fq
  • sample2_lane2.fq

1. Run all core steps per-lane once

At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.

The example data becomes:

  • sample1_lane1.dedup.realn.recal.bam
  • sample1_lane2.dedup.realn.recal.bam
  • sample2_lane1.dedup.realn.recal.bam
  • sample2_lane2.dedup.realn.recal.bam

2. Merge lanes per sample

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

  • sample1.merged.bam
  • sample2.merged.bam

3. Per-sample refinement

You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.

The example data becomes:

  • sample1.merged.dedup.realn.bam
  • sample2.merged.dedup.realn.bam

This gets you two big wins:

  • Dedupping per-sample eliminates PCR duplicates across all lanes in addition to optical duplicates (which are by definition only per-lane)
  • Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.


Created 2012-08-11 03:43:49 | Updated 2013-03-05 17:58:44 | Tags: official faq basic analyst bam
Comments (1)

1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

  • It must be aligned to one of the references described here.
  • It must be sorted in coordinate order (not by queryname and not "unsorted").
  • It must list the read groups with sample names in the header.
  • Every read must belong to a read group.
  • The BAM file must pass Picard validation.

See the BAM specification for more information.

4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

5. How can I tell if my BAM file is sorted properly?

The easiest way to do it is to download Samtools and run the following command to examine the header of your file:

$ samtools view -H /path/to/my.bam
@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:247249719
@SQ     SN:2    LN:242951149
@SQ     SN:3    LN:199501827
@SQ     SN:4    LN:191273063
@SQ     SN:5    LN:180857866
@SQ     SN:6    LN:170899992
@SQ     SN:7    LN:158821424
@SQ     SN:8    LN:146274826
@SQ     SN:9    LN:140273252
@SQ     SN:10   LN:135374737
@SQ     SN:11   LN:134452384
@SQ     SN:12   LN:132349534
@SQ     SN:13   LN:114142980
@SQ     SN:14   LN:106368585
@SQ     SN:15   LN:100338915
@SQ     SN:16   LN:88827254
@SQ     SN:17   LN:78774742
@SQ     SN:18   LN:76117153
@SQ     SN:19   LN:63811651
@SQ     SN:20   LN:62435964
@SQ     SN:21   LN:46944323
@SQ     SN:22   LN:49691432
@SQ     SN:X    LN:154913754
@SQ     SN:Y    LN:57772954
@SQ     SN:MT   LN:16571
@SQ     SN:NT_113887    LN:3994
...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.

6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.

7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).

8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.

9. What's the meaning of the standard read group fields?

For technical details, see the SAM specification on the Samtools website.

Tag Importance SAM spec definition Meaning
ID Required Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read groupIDs may be modified when merging SAM files in order to handle collisions. Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model.
SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper.
PL Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data . It's a good idea to use this field.
LB DNA preparation library identify Essential for MarkDuplicates MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

We do not require value for the CN, DS, DT, PG, PI, or PU fields.

A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

9. My BAM file doesn't have read group and sample information. How do I add it?

Use Picard's AddOrReplaceReadGroups tool to add read group information.

10. How do I know if my BAM file is valid?

Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK.

11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.


Created 2012-08-06 17:28:04 | Updated 2015-07-29 11:53:14 | Tags: vcf annotation
Comments (42)

This document describes "regular" VCF files. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document.


Contents

  1. What is VCF?
  2. Basic structure of a VCF file
  3. Interpreting the VCF file header information
  4. Structure of variant call records
  5. How the genotype and other sample-level information is represented
  6. How to extract information from a VCF in a sane, straightforward way

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

  • Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.

  • NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned :)

  • Don't write home-brewed VCF parsing scripts. It never ends well.


2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
1   873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:26:103,0,26
1   974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:61:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.


3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4
.
.
.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr1,length=249250621,assembly=b37>
##reference=file:human_genome_b37.fasta

We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

  • VCF spec version

The first line:

##fileformat=VCFv4.1

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

  • FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

  • FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

  • GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

  • Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]


4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

  • CHROM and POS : The contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event.

  • ID: An optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.

  • REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

  • QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic. Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

  • FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

  • INFO: Various site-level annotations. The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94. They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.


5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1   873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

Looking at that last column, here is what the tags mean:

  • GT : The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    • 0/0 - the sample is homozygous reference
    • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    • 1/1 - the sample is homozygous alternate In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively. For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.
  • AD and DP : Allele depth and depth of coverage. These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another. DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP. See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage)for more details.

  • PL : Normalized Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are normalized so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale (meaning its P = 1.0 in regular scale). The other values are scaled relative to this most likely genotype. Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "likelihood of the genotype", we mean it is "the probability that the genotype is not correct". That's why the smaller the value, the better it is. [todo: PL details doc]

  • GQ : Quality of the assigned genotype. The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99. Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low. Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.


6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)


Do you have a question on this topic that wasn't addressed anywhere here? Please ask it here in the forum.

About the DNAseq Best Practices

This is our recommended workflow for calling variants in DNAseq data from cohorts of samples, in which steps from data processing up to variant calling are performed per-sample, and subsequent steps are performed jointly on all the individuals in the cohort.

The workflow is divided in three main sections that are meant to be performed sequentially:

  • Pre-processing: from raw DNAseq sequence reads (FASTQ files) to analysis-ready reads (BAM files)
  • Variant discovery: from reads (BAM files) to variants (VCF files)
  • Refinement and evaluation: genotype refinement, functional annotation and callset QC

Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Local Realignment Around Indels; and Base Quality Score Recalibration (BQSR); performed in that order.

Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using BWA mem to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads.

Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.


Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample), joint genotyping (performed per-cohort) and variant filtering (also performed per-cohort). The first two steps are designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs. If there are more than a few hundred samples, we combine the gVCFs in batches of ~200 gVCFs using a specialized tool, CombineGVCFs. This will make the next step more tractable and reflects that the processing bottleneck lies with the number of input files and not the number of samples in those files.

Joint Genotyping

All available samples are then jointly genotyped by taking the gVCFs produced earlier and running GenotypeGVCFs on all of them together to create a set of raw SNP and indel calls. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites.

Variant Quality Score Recalibration

Variant recalibration involves using a machine learning method to assign a well-calibrated probability to each variant call in a raw call set. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.


Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project's requirements.




Frequently Asked Questions


Created 2013-08-02 19:12:19 | Updated 2013-08-02 19:12:40 | Tags: official terms
Comments (0)

There are four major organizational units for next-generation DNA sequencing processes that used throughout the GATK documentation:

  • Lane: The basic machine unit for sequencing. The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane.

  • Library: A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes.

  • Sample: A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Throughout our documentation, we treat samples as independent individuals whose genome sequence we are attempting to determine. Note that from this perspective, tumor / normal samples are different despite coming from the same individual.

  • Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends very specifically on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many deeply sequenced samples (e.g., ESP with 800 EOMI samples) we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.

Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost, since running these commands across all of your data simultaneously requires much more computing power. Please see the documentation for each step to understand what is the best way to group or partition your data for that particular process.


Created 2013-08-02 20:23:38 | Updated 2013-09-07 14:04:11 | Tags: official best-practices workflow
Comments (2)

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Let's say we have this example data:

  • sample1_lane1.fq
  • sample1_lane2.fq
  • sample2_lane1.fq
  • sample2_lane2.fq

1. Run all core steps per-lane once

At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.

The example data becomes:

  • sample1_lane1.dedup.realn.recal.bam
  • sample1_lane2.dedup.realn.recal.bam
  • sample2_lane1.dedup.realn.recal.bam
  • sample2_lane2.dedup.realn.recal.bam

2. Merge lanes per sample

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

  • sample1.merged.bam
  • sample2.merged.bam

3. Per-sample refinement

You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.

The example data becomes:

  • sample1.merged.dedup.realn.bam
  • sample2.merged.dedup.realn.bam

This gets you two big wins:

  • Dedupping per-sample eliminates PCR duplicates across all lanes in addition to optical duplicates (which are by definition only per-lane)
  • Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.


Created 2012-08-11 03:43:49 | Updated 2013-03-05 17:58:44 | Tags: official faq basic analyst bam
Comments (1)

1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

  • It must be aligned to one of the references described here.
  • It must be sorted in coordinate order (not by queryname and not "unsorted").
  • It must list the read groups with sample names in the header.
  • Every read must belong to a read group.
  • The BAM file must pass Picard validation.

See the BAM specification for more information.

4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

5. How can I tell if my BAM file is sorted properly?

The easiest way to do it is to download Samtools and run the following command to examine the header of your file:

$ samtools view -H /path/to/my.bam
@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:247249719
@SQ     SN:2    LN:242951149
@SQ     SN:3    LN:199501827
@SQ     SN:4    LN:191273063
@SQ     SN:5    LN:180857866
@SQ     SN:6    LN:170899992
@SQ     SN:7    LN:158821424
@SQ     SN:8    LN:146274826
@SQ     SN:9    LN:140273252
@SQ     SN:10   LN:135374737
@SQ     SN:11   LN:134452384
@SQ     SN:12   LN:132349534
@SQ     SN:13   LN:114142980
@SQ     SN:14   LN:106368585
@SQ     SN:15   LN:100338915
@SQ     SN:16   LN:88827254
@SQ     SN:17   LN:78774742
@SQ     SN:18   LN:76117153
@SQ     SN:19   LN:63811651
@SQ     SN:20   LN:62435964
@SQ     SN:21   LN:46944323
@SQ     SN:22   LN:49691432
@SQ     SN:X    LN:154913754
@SQ     SN:Y    LN:57772954
@SQ     SN:MT   LN:16571
@SQ     SN:NT_113887    LN:3994
...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.

6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.

7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).

8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.

9. What's the meaning of the standard read group fields?

For technical details, see the SAM specification on the Samtools website.

Tag Importance SAM spec definition Meaning
ID Required Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read groupIDs may be modified when merging SAM files in order to handle collisions. Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model.
SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper.
PL Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data . It's a good idea to use this field.
LB DNA preparation library identify Essential for MarkDuplicates MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

We do not require value for the CN, DS, DT, PG, PI, or PU fields.

A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

9. My BAM file doesn't have read group and sample information. How do I add it?

Use Picard's AddOrReplaceReadGroups tool to add read group information.

10. How do I know if my BAM file is valid?

Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK.

11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.


Created 2012-08-06 17:28:04 | Updated 2015-07-29 11:53:14 | Tags: vcf annotation
Comments (42)

This document describes "regular" VCF files. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document.


Contents

  1. What is VCF?
  2. Basic structure of a VCF file
  3. Interpreting the VCF file header information
  4. Structure of variant call records
  5. How the genotype and other sample-level information is represented
  6. How to extract information from a VCF in a sane, straightforward way

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

  • Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.

  • NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned :)

  • Don't write home-brewed VCF parsing scripts. It never ends well.


2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
1   873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:26:103,0,26
1   974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:61:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.


3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4
.
.
.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr1,length=249250621,assembly=b37>
##reference=file:human_genome_b37.fasta

We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

  • VCF spec version

The first line:

##fileformat=VCFv4.1

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

  • FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

  • FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

  • GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

  • Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]


4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

  • CHROM and POS : The contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event.

  • ID: An optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.

  • REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

  • QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic. Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

  • FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

  • INFO: Various site-level annotations. The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94. They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.


5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1   873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

Looking at that last column, here is what the tags mean:

  • GT : The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    • 0/0 - the sample is homozygous reference
    • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    • 1/1 - the sample is homozygous alternate In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively. For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.
  • AD and DP : Allele depth and depth of coverage. These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another. DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP. See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage)for more details.

  • PL : Normalized Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are normalized so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale (meaning its P = 1.0 in regular scale). The other values are scaled relative to this most likely genotype. Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "likelihood of the genotype", we mean it is "the probability that the genotype is not correct". That's why the smaller the value, the better it is. [todo: PL details doc]

  • GQ : Quality of the assigned genotype. The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99. Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low. Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.


6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)


Do you have a question on this topic that wasn't addressed anywhere here? Please ask it here in the forum.

About the RNAseq Best Practices

This is our recommended workflow for calling variants in RNAseq data from single samples, in which all steps are performed per-sample. In future we will provide cohort analysis recommendations, but these are not yet available.

The workflow is divided in three main sections that are meant to be performed sequentially:

  • Pre-processing: from raw RNAseq sequence reads (FASTQ files) to analysis-ready reads (BAM files)
  • Variant discovery: from reads (BAM files) to variants (VCF files)
  • Refinement and evaluation: genotype refinement, functional annotation and callset QC

Compared to the DNAseq Best Practices, the key adaptations for calling variants in RNAseq focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller, which are highlighted in the figure below.


Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Split'N'Trim; Local Realignment Around Indels (optional); and Base Quality Score Recalibration (BQSR); performed in that order.

Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using STAR aligner (2-pass protocol) to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

Split'N'Trim

Then, an RNAseq-specific step is applied: reads with N operators in the CIGAR strings (which denote the presence of a splice junction) are split into component reads and trimmed to remove any overhangs into splice junctions, which reduces the occurrence of artifacts. At this step, we also reassign mapping qualities from 255 (assigned by STAR) to 60 which is more meaningful for GATK tools.

Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads. This step is considered optional for RNAseq.

Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.


Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample) and variant filtering (also performed per-sample). The first step is designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

Our current recommendation for RNAseq is to run all these steps per-sample. At the moment, we do not recommend applying the GVCF-based workflow to RNAseq data because although there is no obvious obstacle to doing so, we have not validated that configuration. Therefore, we cannot guarantee the quality of results that this would produce.

Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample VCFs containing raw SNP and indel calls.

Per-Sample Variant Filtering

For RNAseq, it is not appropriate to apply variant recalibration in its present form. Instead, we provide hard-filtering recommendations to filter variants based on specific annotation value thresholds. This produces a VCF of calls annotated with fiiltering information that can then be used in downstream analyses.


Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project's requirements.




Frequently Asked Questions


Created 2013-08-02 19:12:19 | Updated 2013-08-02 19:12:40 | Tags: official terms
Comments (0)

There are four major organizational units for next-generation DNA sequencing processes that used throughout the GATK documentation:

  • Lane: The basic machine unit for sequencing. The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane.

  • Library: A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes.

  • Sample: A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Throughout our documentation, we treat samples as independent individuals whose genome sequence we are attempting to determine. Note that from this perspective, tumor / normal samples are different despite coming from the same individual.

  • Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends very specifically on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many deeply sequenced samples (e.g., ESP with 800 EOMI samples) we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.

Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost, since running these commands across all of your data simultaneously requires much more computing power. Please see the documentation for each step to understand what is the best way to group or partition your data for that particular process.


Created 2013-08-02 20:23:38 | Updated 2013-09-07 14:04:11 | Tags: official best-practices workflow
Comments (2)

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Let's say we have this example data:

  • sample1_lane1.fq
  • sample1_lane2.fq
  • sample2_lane1.fq
  • sample2_lane2.fq

1. Run all core steps per-lane once

At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.

The example data becomes:

  • sample1_lane1.dedup.realn.recal.bam
  • sample1_lane2.dedup.realn.recal.bam
  • sample2_lane1.dedup.realn.recal.bam
  • sample2_lane2.dedup.realn.recal.bam

2. Merge lanes per sample

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

  • sample1.merged.bam
  • sample2.merged.bam

3. Per-sample refinement

You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.

The example data becomes:

  • sample1.merged.dedup.realn.bam
  • sample2.merged.dedup.realn.bam

This gets you two big wins:

  • Dedupping per-sample eliminates PCR duplicates across all lanes in addition to optical duplicates (which are by definition only per-lane)
  • Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.


Created 2012-08-11 03:43:49 | Updated 2013-03-05 17:58:44 | Tags: official faq basic analyst bam
Comments (1)

1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

  • It must be aligned to one of the references described here.
  • It must be sorted in coordinate order (not by queryname and not "unsorted").
  • It must list the read groups with sample names in the header.
  • Every read must belong to a read group.
  • The BAM file must pass Picard validation.

See the BAM specification for more information.

4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

5. How can I tell if my BAM file is sorted properly?

The easiest way to do it is to download Samtools and run the following command to examine the header of your file:

$ samtools view -H /path/to/my.bam
@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:247249719
@SQ     SN:2    LN:242951149
@SQ     SN:3    LN:199501827
@SQ     SN:4    LN:191273063
@SQ     SN:5    LN:180857866
@SQ     SN:6    LN:170899992
@SQ     SN:7    LN:158821424
@SQ     SN:8    LN:146274826
@SQ     SN:9    LN:140273252
@SQ     SN:10   LN:135374737
@SQ     SN:11   LN:134452384
@SQ     SN:12   LN:132349534
@SQ     SN:13   LN:114142980
@SQ     SN:14   LN:106368585
@SQ     SN:15   LN:100338915
@SQ     SN:16   LN:88827254
@SQ     SN:17   LN:78774742
@SQ     SN:18   LN:76117153
@SQ     SN:19   LN:63811651
@SQ     SN:20   LN:62435964
@SQ     SN:21   LN:46944323
@SQ     SN:22   LN:49691432
@SQ     SN:X    LN:154913754
@SQ     SN:Y    LN:57772954
@SQ     SN:MT   LN:16571
@SQ     SN:NT_113887    LN:3994
...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.

6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.

7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).

8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.

9. What's the meaning of the standard read group fields?

For technical details, see the SAM specification on the Samtools website.

Tag Importance SAM spec definition Meaning
ID Required Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read groupIDs may be modified when merging SAM files in order to handle collisions. Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model.
SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper.
PL Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data . It's a good idea to use this field.
LB DNA preparation library identify Essential for MarkDuplicates MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

We do not require value for the CN, DS, DT, PG, PI, or PU fields.

A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

9. My BAM file doesn't have read group and sample information. How do I add it?

Use Picard's AddOrReplaceReadGroups tool to add read group information.

10. How do I know if my BAM file is valid?

Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK.

11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.


Created 2012-08-06 17:28:04 | Updated 2015-07-29 11:53:14 | Tags: vcf annotation
Comments (42)

This document describes "regular" VCF files. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document.


Contents

  1. What is VCF?
  2. Basic structure of a VCF file
  3. Interpreting the VCF file header information
  4. Structure of variant call records
  5. How the genotype and other sample-level information is represented
  6. How to extract information from a VCF in a sane, straightforward way

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

  • Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.

  • NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned :)

  • Don't write home-brewed VCF parsing scripts. It never ends well.


2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
1   873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:26:103,0,26
1   974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:61:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.


3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4
.
.
.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr1,length=249250621,assembly=b37>
##reference=file:human_genome_b37.fasta

We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

  • VCF spec version

The first line:

##fileformat=VCFv4.1

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

  • FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

  • FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

  • GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

  • Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]


4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

  • CHROM and POS : The contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event.

  • ID: An optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.

  • REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

  • QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic. Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

  • FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

  • INFO: Various site-level annotations. The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94. They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.


5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1   873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

Looking at that last column, here is what the tags mean:

  • GT : The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    • 0/0 - the sample is homozygous reference
    • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    • 1/1 - the sample is homozygous alternate In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively. For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.
  • AD and DP : Allele depth and depth of coverage. These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another. DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP. See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage)for more details.

  • PL : Normalized Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are normalized so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale (meaning its P = 1.0 in regular scale). The other values are scaled relative to this most likely genotype. Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "likelihood of the genotype", we mean it is "the probability that the genotype is not correct". That's why the smaller the value, the better it is. [todo: PL details doc]

  • GQ : Quality of the assigned genotype. The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99. Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low. Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.


6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)


Do you have a question on this topic that wasn't addressed anywhere here? Please ask it here in the forum.