This tool combines VCF records from different sources. Any (unique) name can be used to bind your rod data and any number of sources can be input. This tool currently supports two different combination types for each of variants (the first 8 fields of the VCF) and genotypes (the rest)
For a complete, detailed argument reference, refer to the GATK document page here.
CombineVariants will include a record at every site in all of your input VCF files, and annotate which input ROD bindings the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. However, any part of the Venn of the N merged VCFs can be exacted using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.
The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the Tech Doc page for the tool.
The set INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.
set=Intersection : occurred in both call sets, not filtered out
set=NAME : occurred in the call set NAME only
set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2
set=filteredInAll : occurred in both call sets, but was filtered out of both
For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.
You can use -setKey foo to change the set=XXX tag to foo=XXX in your output. Additionally, -setKey null stops the set tag=value pair from being emitted at all.
Add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. The only fields emitted will be GT:GQ for genotypes and the keySet for INFO
An even more extreme output format is -sites_only, a general engine capability, where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.
Add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. Useful, for example in combining several data sets where we only want to keep sites present in for example at least 2 of them (in which case -minN 2 should be added to the command line).
In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == ";Intersection";' -o intersect.vcf
This results in two vcf files, which look like:
==> union.vcf <==
1 990839 SNP1-980702 C T . PASS AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1 990882 SNP1-980745 C T . PASS CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1 990984 SNP1-980847 G A . PASS CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1 992265 SNP1-982128 C T . PASS CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1 992819 SNP1-982682 G A . id50 CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1 993987 SNP1-983850 T C . PASS CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1 994391 rs2488991 G T . PASS AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1 996184 SNP1-986047 G A . PASS CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1 998395 rs7526076 A G . PASS AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1 999649 SNP1-989512 G A . PASS CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni
==> intersect.vcf <==
1 950243 SNP1-940106 A C . PASS AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1 957640 rs6657048 C T . PASS AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1 959842 rs2710888 C T . PASS AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1 977780 rs2710875 C T . PASS AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1 985900 SNP1-975763 C T . PASS AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1 987200 SNP1-977063 C T . PASS AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1 987670 SNP1-977533 T G . PASS AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1 990417 rs2465136 T C . PASS AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1 990839 SNP1-980702 C T . PASS AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1 998395 rs7526076 A G . PASS AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection