VCF

VCF stands for Variant Call Format, and it is used by the 1000 Genomes project to encode structural genetic variants.

  • Variant calls include SNPs, indels, and genomic rearrangements.
  • Samples may also be annotated with attribute information, including pedigree and family information. IGV uses these annotatations to group, sort, and filter samples, e.g. to group samples by population group.

A consistent color sheme is used in the variant display row, which is the top row, for files with or without geneotypes.

  • blue - minor allele frequency/fraction is known from annotation or genotype data
  • grey - minor allele frequency is not known
  • red - height is proportional to minor allele frequency

Required Extensions: .vcf, .vcf.gz

If the file is gzipped (ends with .vcf.gz), it must have an accompanying Tabix index (see below).

VCF Requirements

IGV supports VCF Version 4.

VCF data files must be indexed for viewing in IGV, either by using igvtools or by using Tabix. 

  • igvtools can be run from the command line or IGV itself (Tools>Run igvtools...)  After launching, choose the Index command and browse to your .vcf file. The index file (.idx) will be created in the same directory as the .vcf file.
    • igvtools also sorts .vcf files.
  • Tabix creates a .tbi file.  Tabix, including documentation, is available from the SamTools Web site.  

Load a BAM track for a sample in a VCF file

Display reads associated with a variant genotype in a VCF file by associating BAM files with samples in a VCF file.

Associate BAM files with samples in the VCF file using a 2 column tab-delimited mapping file. 

  • The filename must be  <vcf file name>.mapping. In other words add .mapping to the end of the vcf file name.  
  • The first column is the sample name from the vcf file, the second the path to the bam file. The bam file path can be a URL or file path, and it can be either absolute or relative to the path to the vcf file.   
  • If the mapping file is present it will be loaded automatically, and a new menu item will appear in the VCF track called "load alignments".

VCF Specification

The version 4.0 spec: http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0

Example V.4.0 File:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

This example shows in order:

  • A good, simple SNP
  • A possible SNP that has been filtered out because its quality is below 10
  • A site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference sequencing error)
  • A site that is called monomorphic reference (i.e., with no alternate alleles),
  • A microsatellite with two alternative alleles, one a deletion of 3 bases (TCT), and the other an insertion of one base (A).

Genotype data are given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depth, and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite calls are unphased.