This tool combines VCF records from different sources. Any (unique) name can be used to bind your rod data and any number of sources can be input. This tool currently supports two different combination types for each of variants (the first 8 fields of the VCF) and genotypes (the rest)
For a complete, detailed argument reference, refer to the GATK document page here.
CombineVariants will include a record at every site in all of your input VCF files, and annotate which input ROD bindings the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. However, any part of the Venn of the N merged VCFs can be exacted using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.
The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the Tech Doc page for the tool.
The set INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.
set=Intersection : occurred in both call sets, not filtered out
set=NAME : occurred in the call set NAME only
set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2
set=filteredInAll : occurred in both call sets, but was filtered out of both
For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.
You can use -setKey foo to change the set=XXX tag to foo=XXX in your output. Additionally, -setKey null stops the set tag=value pair from being emitted at all.
Add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. The only fields emitted will be GT:GQ for genotypes and the keySet for INFO
An even more extreme output format is -sites_only, a general engine capability, where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.
Add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. Useful, for example in combining several data sets where we only want to keep sites present in for example at least 2 of them (in which case -minN 2 should be added to the command line).
In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == ";Intersection";' -o intersect.vcf
This results in two vcf files, which look like:
==> union.vcf <==
1 990839 SNP1-980702 C T . PASS AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1 990882 SNP1-980745 C T . PASS CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1 990984 SNP1-980847 G A . PASS CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1 992265 SNP1-982128 C T . PASS CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1 992819 SNP1-982682 G A . id50 CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1 993987 SNP1-983850 T C . PASS CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1 994391 rs2488991 G T . PASS AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1 996184 SNP1-986047 G A . PASS CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1 998395 rs7526076 A G . PASS AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1 999649 SNP1-989512 G A . PASS CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni
==> intersect.vcf <==
1 950243 SNP1-940106 A C . PASS AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1 957640 rs6657048 C T . PASS AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1 959842 rs2710888 C T . PASS AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1 977780 rs2710875 C T . PASS AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1 985900 SNP1-975763 C T . PASS AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1 987200 SNP1-977063 C T . PASS AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1 987670 SNP1-977533 T G . PASS AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1 990417 rs2465136 T C . PASS AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1 990839 SNP1-980702 C T . PASS AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1 998395 rs7526076 A G . PASS AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
Hello,
I called variants in chunks (100000 bp chunks) on chromosome 20 on 450 reduced BAMS using -T UnifiedGenotyper, which resulted in the generation of 631 VCFs.
Now I want to combine these 631 VCFs using -T CombineVariants, however there seems to be a problem with some of the VCFs and all I get is an error message.
I can combine the 5 first VCFs that were generated, but when I add a sixth one (chr20:400000_500000) I get the following error:
##### ERROR MESSAGE: For input string: "20"
If I exclude that file, I can combine about 30 VCFs until I encounter a similar error message for a different VCF file:
##### ERROR MESSAGE: For input string: "120"
I've looked through both of the "problematic" VCFs and can't see how they differ from the ones that I can combine.
Any idea of what may be going wrong and how I can solve this? At the moment I can only identify problematic VCF by trying to combine them using CombineVariants since I don't know what it is about these particular VCF files that is causing the problem.
Thanks for your help, Tota
Hi,
I used Beagle to phase my data but for some indels, I have some probleme :
example :
Input vcf :
2 68599872 . ATG A 14.40 PASS AC=1;AC1=1;AF=0.028
Input for beagle created by ProduceBeagleInput:
2:68599872 TG - 1.0000 0.0000 0.0000 ......
Output vcf created by BeagleOutputToVCF:
2 68599872 . ATG . 14.40 BGL_RM_WAS_- AC1=1;AF1=0.02965.....
error message by CombineVariants:
MESSAGE: Badly formed variant context at location 68599872 in contig 2. Reference length must be at most one base shorter than location size
Can you help me?
Tipahine
When using CombineVariants, my variants get a FILTER value of either PASS or LowQual. Would it be possible to add an option to CombineVariants which prevents the FILTER value to be set to PASS? Otherwise I have to do some file processing before I run ApplyRecalibration further downstream. It would be great if this was a feature of all walkers and not just VariantFiltration. I'm not sure if the forum is the right place for feature requests. Happy to use Bugzilla or similar instead. Thanks.
Hi,
I'm just reverse engineering a colleagues script and I've noticed they're using CombineVariants in PRIORITIZE mode but without a -priority argument. I've looked at the documentation and I can't see what the defined behaviour would be in this situation. Would default priority in this situation follow the order of the arguments supplied; the reverse order; or random?
Thanks, Martin
Edit: Nevermind, from what I can see from the source it should be erroring out if -priority is not supplied. I must have missed something in the pipeline script.
Edit 2: No wait
if ( genotypeMergeOption == VariantContextUtils.GenotypeMergeType.PRIORITIZE && PRIORITY_STRING == null )
throw new UserException.MissingArgument("rod_priority_list", "Priority string must be provided if you want to prioritize genotypes");
is pointless because this is run first in initialize:
if ( PRIORITY_STRING == null ) {
PRIORITY_STRING = Utils.join(",", vcfRods.keySet());
logger.info("Priority string not provided, using arbitrary genotyping order: " + PRIORITY_STRING);
}
This should follow the input order yes? Unless vcfRods.keySet is sorted?
When I run CombineVariants on two vcf files with variants at the same position but with different REF alleles and also different sets of ALT alleles, the AD fields for the genotypes are not updated to reflect the changes in the ALT field. The REF and ALT fields and the GT field for each genotype are all correctly updated. For example combining
3 10128965 rs71052293 CTT CT,C,CTTT 19936.43 PASS AC=1,1,1;AF=0.25,0.25,0.25;AN=4 GT:AD:DP:GQ:PL 0/2:115,0,33,12:230:6.96:980,1237,2795,0,946,1900,7,679,467,817 3/1:97,13,20,16:229:99:804,221,832,581,176,3047,521,0,1653,1595
and
3 10128965 rs71052293 CT C,CTT,CTTT 14280.61 PASS AC=1,1,1;AF=0.25,0.25,0.25;AN=4 GT:AD:DP:GQ:PL 2/1:110,20,33,18:237:1.90:850,289,1027,457,0,1487,147,877,2,1858 0/3:80,48,5,29:209:99:1835,875,977,2101,1119,3322,0,142,331,462
gives
3 10128965 rs71052293 CTT CT,C,CTTT,CTTTT 19936.43 PASS AC=2,1,2,1;AF=0.250,0.125,0.250,0.125;AN=8;set=Intersection GT:AD:DP:GQ 0/2:115,0,33,12:230:7 3/1:97,13,20,16:229:99 3/1:110,20,33,18:237:2 0/4:80,48,5,29:209:99
There five alleles (one REF and four ALT) but only four AD fields for each genotype.
My command line:
java -jar -Xmx4g GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V test_input1.vcf -V test_input2.vcf -o test_combined.vcf
Is this a known limitation or a bug?
Hi. I want to merge two VCF files. Initially I was selected only indels(by select variant option). Now I want to merge these two VCF file which contains only INDELS. But When I run the command, I am getting the same error:
ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.NumberFormatException: For input string: "."
I run this command:
java -jar -Xmx2g GenomeAnalysisTK.jar -R hg19_5.fasta -T CombineVariants -V indelsample1.vcf -V indelsample3.vcf -o indels1s3.vcf -genotypeMergeOptions UNIQUIFY
Could you please tell me what is the reason behind this? and how to merge two VCF file having INDELS?
Thanks in advance.
I am trying to merge two vcfs (SNVs and INDELs) from the same sample. The problem appears to be that the INDEL vcf defines "combined_sample_name" but the SNV vcf does not. So when I merge I get two sample columns. How can I force GATK to treat them as a single sample?
I tried --assumeIdenticalSamples to do a "simple merge," but that made no difference.
As a side note, these vcfs are from an Ion Torrent machine.
Thanks!
java -jar $GATK \
-R $TSP_FILEPATH_GENOME_FASTA \
-T CombineVariants \
--assumeIdenticalSamples \
-V:${baseFolderName} ${i}/SNP_variants.vcf \
-V:${baseFolderName} ${i}/indel_variants.vcf \
-o ${RESULTS_DIR}/${baseFolderName}_variants.vcf
Dear team, I am new to GATK and I am having a hard time simply trying to merge vcf files. I have tried to solve the problem by referring to the guide and to previous posts, but nothing woked. Actually I found several discussions about the very same error message I receive, but it seems that no clear answere was provided. Here is this message:
I have tried three different MS Dos commands to do the task (see belbow), but the message didn't change:
1. java -jar GenomeAnalysisTK.jar -T CombineVariants -R E:\RessourcesGATK\ucsc.hg19.fasta -V:sample1 E:\TestGATK\sample1.vcf -V:sample2 E:\TestGATK\sample2.vcf -o combined.vcf
2. java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants --variant E:\TestGATK\sample1.vcf --variant E:\TestGATK\sample2.vcf -o output.vcf -genotypeMergeOptions UNIQUIFY
3.java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants --variant E:\TestGATK\sample1.vcf --variant E:\TestGATK\sample2.vcf -o output.vcf -genotypeMergeOptions PRIORITIZE -priority foo,bar
I have also tried to use the reference human_g1k_v37.fasta as a resource, but it was the same. I have suppressed the # before CHROM in the header line, tested vcf generated by Samtools or by GATK, but it did not change anything. Is this a problem of environment? I haven't read anything mentioning that GATK could not work with MS Dos.
Thank you very much for your help. S.
I have a set of VCFs with identical positions in them:
VCF1: 1 10097 . T . 26 . AN=196;DP=1622;MQ=20.06;MQ0=456 GT:DP
VCF2: 1 10097 . T . 21.34 . AN=198;DP=2338;MQ=19.53;MQ0=633 GT:DP
VCF3: 1 10097 . T . 11.70 . AN=240;DP=3957;MQ=19.74;MQ0=1085 GT:DP
VCF4: 1 10097 . T . 15.56 . AN=134;DP=1348;MQ=18.22;MQ0=442 GT:DP
If I use all of them as input for VariantRecalibrator, which annotations will VariantRecalibrator use? Should I instead merge the VCFs with CombineVariants and run VariantAnnotator, before I run VariantRecalibrator?
I'm not sure if the forum is for asking technical questions only or you are allowed to ask for best practices as well. Feel free to delete my question, if it doesn't belong here. Thank you.