Tagged with #combinevariants 2 documentation articles | 3 announcements | 43 forum discussions

A new tool has been released!

Check out the documentation at CombineVariants.

This tool combines VCF records from different sources. Any (unique) name can be used to bind your rod data and any number of sources can be input. This tool currently supports two different combination types for each of variants (the first 8 fields of the VCF) and genotypes (the rest)

For a complete, detailed argument reference, refer to the GATK document page here.

2. Logic for merging records across VCFs

CombineVariants will include a record at every site in all of your input VCF files, and annotate which input ROD bindings the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. However, any part of the Venn of the N merged VCFs can be exacted using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

3. Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the Tech Doc page for the tool.

4. Understanding the set attribute

The set INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

• set=Intersection : occurred in both call sets, not filtered out

• set=NAME : occurred in the call set NAME only

• set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2

• set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

5. Changing the set key

You can use -setKey foo to change the set=XXX tag to foo=XXX in your output. Additionally, -setKey null stops the set tag=value pair from being emitted at all.

6. Minimal VCF output

Add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. The only fields emitted will be GT:GQ for genotypes and the keySet for INFO

An even more extreme output format is -sites_only, a general engine capability, where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

7. Combining Variant Calls with a minimum set of input sites

Add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. Useful, for example in combining several data sets where we only want to keep sites present in for example at least 2 of them (in which case -minN 2 should be added to the command line).

8. Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf


This results in two vcf files, which look like:

==> union.vcf <==
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       990882  SNP1-980745     C       T       .       PASS    CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1       990984  SNP1-980847     G       A       .       PASS    CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1       992265  SNP1-982128     C       T       .       PASS    CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1       992819  SNP1-982682     G       A       .       id50    CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1       993987  SNP1-983850     T       C       .       PASS    CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1       994391  rs2488991       G       T       .       PASS    AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1       996184  SNP1-986047     G       A       .       PASS    CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1       999649  SNP1-989512     G       A       .       PASS    CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni

==> intersect.vcf <==
1       950243  SNP1-940106     A       C       .       PASS    AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1       957640  rs6657048       C       T       .       PASS    AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1       959842  rs2710888       C       T       .       PASS    AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1       977780  rs2710875       C       T       .       PASS    AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1       985900  SNP1-975763     C       T       .       PASS    AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1       987200  SNP1-977063     C       T       .       PASS    AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1       987670  SNP1-977533     T       G       .       PASS    AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1       990417  rs2465136       T       C       .       PASS    AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection


GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.

Unified Genotyper

• Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

Haplotype Caller

• Improved the indexing scheme for gVCF outputs using the reference calculation model.
• The reference calculation model now works with reduced reads.
• Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
• Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

Variant Recalibrator

• Disable tranche plots in INDEL mode.
• Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

• Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

Diagnose Targets

• Added calculation for GC content.
• Added an option to filter the bases based on their quality scores.

Combine Variants

• Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

Select Variants

• Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.

Miscellaneous

• SplitSamFile now produces an index with the BAM.
• Length metric updates to QualifyMissingIntervals.
• Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
• Picard jar updated to version 1.104.1628.
• Tribble jar updated to version 1.104.1628.
• Variant jar updated to version 1.104.1628.

GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Base Quality Score Recalibration

• Improved the algorithm around homopolymer runs to use a "delocalized context".
• Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
• Fixed bug where the tool failed for reads that begin with insertions.
• Fixed bug in the scatter-gather functionality.
• Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).

Unified Genotyper

• Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
• The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
• Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
• Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
• Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
• Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
• Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
• Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
• The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
• Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
• Generalized ploidy model now handles reference calls correctly.

Haplotype Caller

• Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
• Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
• Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
• Now requires at least 10 samples to merge variants into complex events.

Variant Annotator

• Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.

• Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
• Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
• Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.

Variant Filtration

• Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.

Variant Eval

• AlleleCount stratification now supports records with ploidy other than 2.

Combine Variants

• Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
• Now outputs the first non-missing QUAL, not the maximum.

Select Variants

• Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
• Removed the -number argument because it gave biased results.

Validate Variants

• Added option to selectively choose particular strict validation options.
• Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
• improved the error message around unused ALT alleles.

Somatic Indel Detector

• Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).

Miscellaneous

• Fixed raw HapMap file conversion bug in VariantsToVCF.
• Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
• Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
• Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
• Fixed bug in BCF2 writer for case where all genotypes are missing.
• Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
• Fixed bug in Phase By Transmission when there are no likelihoods present.
• Fixed bug in fasta .fai generation.
• Picard jar remains at version 1.67.1197.
• Tribble jar remains at version 110.

Base Quality Score Recalibration

• Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
• Implemented support for SOLiD no call strategies other than throwing an exception.
• Fixed smoothing in the BQSR bins.
• Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.

Unified Genotyper

• Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
• UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
• Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
• In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
• Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.

Haplotype Caller

• Added LowQual filter to the output when appropriate.
• Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
• Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
• Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
• Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
• Fixed bug where non-standard bases from the reference would cause errors.
• Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.

• Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
• Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
• Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.

Variant Eval

• Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
• Fixed incorrect allele counting in IndelSummary evaluation.

Combine Variants

• Now outputs the first non-MISSING QUAL, instead of the maximum.
• Now supports multi-threaded running (with the -nt argument).

Select Variants

• Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
• No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
• If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).

Miscellaneous

• GATK now generates a proper error when a gzipped FASTA is passed in.
• Various improvements throughout the BCF2-related code.
• Removed various parallelism bottlenecks in the GATK.
• Added support of X and = CIGAR operators to the GATK.
• Catch NumberFormatExceptions when parsing the VCF POS field.
• Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
• Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
• We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
• Added support for handling complex events in ValidateVariants.
• Picard jar remains at version 1.67.1197.
• Tribble jar remains at version 110.

Currently I am following GATK best practice for using HC 3.0+, however I'm splitting my calls to chromosomal regions (-L). Next are the following step I perform working up to GenotypeGVCF and my question.

1 - I use CatVariants (following HC) to merge all 25 chromosome gvcf files into a single gvcf file per individual.
2 - I use CombineGVCF to merge 2 .. n number of individuals together. This is done because some analysis have 300+ individuals. 3- I then use CombineGVCF again to merge all the file from step 2 into one large gvcf file for one large joint GenotypeGVCF step. 4 - GenotypeGVCF is done again based on chromosomal regions (-L), which is followed by a additional CatVariants before VQSR.

The question I have this this: Given the size of the analysis I have noticed that my CombineGVCF done in step 3 can take anywhere from 4-8 hours. I was wondering if I could change this step to use CombineVariants and have the result be the same (unlost data). The main reason for this would be because GATK currently allow CombineVariants to use the -nt option.

Thanks for you time and work.

--Shawn

Is there a page, where I can see an example of what CombineVariants does with and without --printComplexMerges? I'm wondering whether to use bcftools merge or GATK CombineVariants.

Dear GATK team,

I am running the following command:

java -jar GenomeAnalysisTK_v33.jar -R genome.fa -T CombineVariants --variant file1.vcf --variant file2.vcf --variant file3.vcf --variant file4.vcf --variant file5.vcf --variant file6.vcf --variant file7.vcf --variant file8.vcf --variant file9.vcf --variant file10.vcf --minimumN 10 -o combined.vcf -genotypeMergeOptions UNIQUIFY

The output vcf file has the header for the sample columns:

sample.variant sample.variant10 sample.variant2 sample.variant3 sample.variant4 sample.variant5 sample.variant6 sample.variant7 sample.variant8 sample.variant9

What do these numbers represent? For example, does sample.variant10 equal to the 10th VCF file (it would be file10.vcf in this case) and so on?

Thank you!

Joe

Hey GATK Team!

I'm attempting to merge identical sample sets called on different (disjoint) chromosomes with the CombineVariants tool with the --assumeIdenticalSamples flag enabled, but the v3.3-0 tool is behaving as if this option is not enabled and it requests I specify a --genotypemergeoption. When I execute the same exact command using v3.2-2, the tool runs to completion without error.

Enabling --genotypemergeoption UNIQIFY with v3.3-0 just to see what happens (while --assumeIdenticalSamples is still enabled) outputs each sample 18 times (once for each chromosome) without any loss in the number of loci (sum of loci in individual files equals sum in combined output file).

I believe this may be a bug, but I could be wrong...

INFO  13:10:49,188 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:10:49,190 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
INFO  13:10:49,191 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  13:10:49,194 HelpFormatter - Program Args: -T CombineVariants -R genome.fasta -V ./homVar.list --out homVar.SNP.vcf --suppressCommandLineHeader --assumeIdenticalSamples
INFO  13:10:49,198 HelpFormatter - Executing as bredeson@gpint104.nersc.gov on Linux 2.6.32-431.20.3.el6.nersc.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13.
INFO  13:10:49,198 HelpFormatter - Date/Time: 2014/12/28 13:10:49
INFO  13:10:49,199 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:10:49,199 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:10:49,731 GenomeAnalysisEngine - Strictness is SILENT
INFO  13:10:50,015 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  13:10:50,731 GenomeAnalysisEngine - Preparing for traversal
INFO  13:10:50,751 GenomeAnalysisEngine - Done preparing for traversal
INFO  13:10:50,752 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  13:10:50,752 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  13:10:50,752 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  13:10:51,578 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Duplicate sample names were discovered but no genotypemergeoption was supplied. To combine samples without merging specify --genotypemergeoption UNIQUIFY. Merging duplicate samples without specified priority is unsupported, but can be achieved by specifying --genotypemergeoption UNSORTED.
##### ERROR ------------------------------------------------------------------------------------------

INFO  13:02:06,270 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:02:06,272 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03
INFO  13:02:06,272 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  13:02:06,275 HelpFormatter - Program Args: -T CombineVariants -R genome.fasta -V ./homVar.list --out homVar.SNP.vcf --suppressCommandLineHeader --assumeIdenticalSamples
INFO  13:02:06,279 HelpFormatter - Executing as bredeson@gpint104.nersc.gov on Linux 2.6.32-431.20.3.el6.nersc.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13.
INFO  13:02:06,280 HelpFormatter - Date/Time: 2014/12/28 13:02:06
INFO  13:02:06,280 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:02:06,280 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:02:06,812 GenomeAnalysisEngine - Strictness is SILENT
INFO  13:02:07,098 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  13:02:07,818 GenomeAnalysisEngine - Preparing for traversal
INFO  13:02:07,837 GenomeAnalysisEngine - Done preparing for traversal
INFO  13:02:07,838 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  13:02:07,838 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  13:02:07,838 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  13:02:07,840 CombineVariants - Priority string is not provided, using arbitrary genotyping order: null
INFO  13:02:10,798 ProgressMeter -            done      2967.0     2.0 s      16.6 m       88.9%     2.0 s       0.0 s
INFO  13:02:10,799 ProgressMeter - Total runtime 2.96 secs, 0.05 min, 0.00 hours
INFO  13:02:11,589 GATKRunReport - Uploaded run statistics report to AWS S3


Hi GATK team,

I am attempting to combine a HaplotypeCaller generated VCF with some indels called using pindel using the following arguments (GATK v3.3-0-g37228af):

-R /data/shared/ref/b37/human_g1k_v37.fasta -T CombineVariants --variant:GATK var.HiSeqDecember.raw.vcf --variant:pindel pindel_combined.vcf -o var.HiSeqDecember.pindel.raw.vcf -genotypeMergeOptions PRIORITIZE -priority GATK,pindel

However I get the following error:

ERROR MESSAGE: Badly formed variant context at location 1:157718231; getEnd() was 157718235 but this VariantContext contains an END key with value 157718231

The variants in question are (from GATK):

1 157718231 . CAAAT C,CAAATAAAT 2533.56 PASS AC=3,13;AF=0.125,0.542;AN=24;BaseQRankSum=1.762;ClippingRankSum=-0.327;DP=126;FS=0.000;HOMLEN=39;HOMSEQ=AAATAAATAAATAAATAAATAAATAAATAAATAAATAAA;InbreedingCoeff=-0.1260;MLEAC=3,13;MLEAF=0.125,0.542;MQ=70.00;MQ0=0;MQRankSum=0.920;QD=22.22;ReadPosRankSum=-0.893;SOR=0.976;SVLEN=4;SVTYPE=INS;set=Intersection GT:DP:GQ 0/0:10:30 0/2:9:18 2/2:6:18 2/2:5:15 0/1:10:99 0/2:14:99 2/2:8:24 2/2:6:18 2/2:7:21 0/2:17:99 0/1:5:75 0/1:6:27

and (from pindel):

1 157718231 . C CAAAT . PASS AC=2;AF=0.143;AN=14;END=157718231;HOMLEN=39;HOMSEQ=AAATAAATAAATAAATAAATAAATAAATAAATAAATAAA;SVLEN=4;SVTYPE=INS;set=variant3-variant4-variant6-variant7-variant8-variant9-variant10 GT:AD ./. ./. 0/0:0,7 0/0:0,6 ./. 0/0:0,9 0/0:0,8 0/0:0,7 0/0:0,8 1/1:0,12 ./. ./.

It is worth noting that the pindel VCF here was merged together from several pindel-generated VCFs using CombineVariants without any complaint from the GATK. It looks to me that the END key is correct for the pindel variant (a simple insertion), but the GATK is confused due to the mixed deletion/insertion variant generated by the HaplotypeCaller at the same position (without an END key).

I can rerun the command after stripping all END tags from the pindel VCF and the command completes successfully, so this is not a showstopper for me but I assume this is a bug(?) and if so, it would be great if there were a fix.

Cheers,

Dave

Hi GATK team, i would like to seek opinion from your team to find the best workflow that best fit my data. Previously i've been exploring both variant calling algorithms UnifiedGenotyper and HaplotypeCaller, and i would love to go for UnifiedGenotyper considering of the sensitivity and the analysis runtime. Due to my experimental cohort samples grows from time to time, so i've opt for single sample calling follow by joint-analysis using combineVariants instead of doing multiple-samples variant calling. However by doing so, i've experience few drawbacks from it (this issue was discussed at few forums). For a particular SNP loci, we wouldn't know whether the "./." reported for some of the samples are due to no reads covering that particular loci, or it doesn't pass certain criteria during variant calling performed previously, or it is a homo-reference base (which i concern this most and can't cope to lost this information).

Then, i found this "gvcf", and it is potentially to solve my problem (Thanks GATK team for always understand our researcher's need)!! Again, i'm insist of opt for unifiedGenotyper instead of haplotypeCaller to generate the gvcf, and reading from the forum at https://www.broadinstitute.org/gatk/guide/tagged?tag=gvcf, i would assume that as VCFs produced by "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" to be something alike with the gvcf file produced by HaplotyperCaller. However i couldn't joint them up using either "CombineVariants" or "CombineGVCFs", most probably i think "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" doesn't generate gvcf format.

Can you please give me some advice to BEST fit my need and with minimum runtime (UnifiedGenotyper would be my best choice), is there any method to joint the ALL_SITES vcf file produced by UnifiedGenotyper which i might probably missed out from the GATK page?

Hello everyone, I recenltly ran -T CombineVariants on a set of samples and I noticed that the allele frequency (AF) in the info field changed. In the individual VCF, AF=0.3 for the particular variant I was looking at, however, after I combined multiple VCFs, AF=0.5.... however the variant was only seen in the one sample, so how could the frequency change? Any insight is appreciated.

Thank you,

Ricky

Hi,

I am merging several genome vcf files apparently created by gvcftools [1]. I noticed that CombineVariants is incorrectly summing allele counts (AC) in some corner cases, and was wondering if the format [2] was supported?

If the format should be supported by the CombineVariants tool (gvcf is VCF 4.1 conformant), I have isolated a minimal example that should be of help for bugfixing.

Thanks, Paweł

Hi GATK team,

I was working on generating a combined VCF using 150+ VCFs (building the sort of cohort). The purpose of it is to calculate variants cohort frequency. But I found the genotype is messed up in the combined VCF. Here is my cmd line:

java -jar /GATK/GenomeAnalysisTK-2.7-4/GenomeAnalysisTK.jar -R refernce.fasta \

-T CombineVariants \

--variant sample1.vcf \

--variant sample2.vcf \

-o combined.vcf

Here is one example of one variant/ position in the combined VCF. The record is very long in the combined VCF, I just grabbed the related columns here.

1 22082967 rs35545280 CAAA CAA,C,CA,CAAAA 76.73 PASS AC=109,6,104,6;AF=0.368,0.020,0.351,0.020;AN=296;DB;DP=9869;GC=48.13;MQ0=0;PercentNBaseSolid=0.0000;RU=A;STR;set=filterInvariant-filterInvariant2-filterInvariant3… GT:DP:GQ

In this record, sample 1 has this variant and it shows as "0/3:46:99",

but in the sample1.vcf, it is listed as

And you can see that the genotype in combined VCF for sample 1 is 0/3, but in its original its is 1/1 which is homozygous. So when I calculate the cohort frequency, I'm confused on matching genotype of this variant for sample 1.

To give you more idea, I listed another sample of same variant in combined.VCF and its record in sample2.VCF.

In the combined.VCF, sample 2 shows as "0/3:91:99".

In the sample2.VCF, the record is:

Where you can see the genotype is 1/2, but in the combined VCF, it shows as "0/3".

Please advise me if I should use any parameter in the cmd line to solve this problem.

Thank you. Linda

Hi all, I have a minor concern.

After using CombineVariants tools to combine variants from GATK and Samtools, MQ annotation is missing for some of the variants specifically for variants which are present in both GATK and Samtools. Below is an example:

$grep "16203159" VCF/GatkSNP.vcf chr8 16203159 . G T 4609.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-7.105;ClippingRankSum=0.652;DP=333; FS=1.794;MLEAC=1;MLEAF=0.500;MQ=57.94;MQ0=0;MQRankSum=1.205;QD=13.84;ReadPosRankSum=0.157 GT:AD:DP:GQ:PL 0/1:163,170:333:99:4638,0,4651$ grep "16203159" VCF/SamSNP.vcf
chr8    16203159    .   G   T   225 .   DP=346;VDB=4.969420e-01;RPB=-1.470555e+00;AF1=0.5;AC1=1;DP4=82,84,81,92;
MQ=58;FQ=225;PV4=0.66,6.2e-09,1,1   GT:PL:GQ    0/1:255,0,255:99


MQ annotation is present in both the VCF files. But after combining MQ is missing as shown below:

$grep "16203159" TMP_files/SNP_INDEL.vcf chr8 16203159 . G T 4609.77 . AC=1;AC1=1;AF=0.500;AF1=0.5;AN=2;BaseQRankSum=-7.105; ClippingRankSum=0.652;DP=679;DP4=82,84,81,92;FQ=225;FS=1.794;MLEAC=1;MLEAF=0.500;MQ0=0;MQRankSum=1.205; PV4=0.66,6.2e-09,1,1;QD=13.84;RPB=-1.470555e+00;ReadPosRankSum=0.157;VDB=4.969420e-01;set=GatkSNP-SamSNP GT:AD:DP:GQ:PL 0/1:163,170:333:99:4638,0,4651  Variants which are present in only of the VCF files do have MQ annotation after using CombineVariants. This happens only for variants which are present in both VCF files. Could you help to get MQ annotations in this case? Here is the command used: java -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fa --variant:GatkSNP GatkSNP.vcf --variant:GatkINDEl GatkINDEL.vcf --variant:SamSNP SamSNP.vcf --variant:SamINDEL SamINDEL.vcf -o SNP_INDEL.vcf -genotypeMergeOptions PRIORITIZE -priority GatkSNP,GatkINDEL,SamSNP,SamINDEL --filteredrecordsmergetype KEEP_UNCONDITIONAL  Hello everyone, I have used the ion proton system to sequence 15 sample exomes so far. In this system, only 3 samples can be run at a given time, so I essentially have five sets of samples. Is it still possible to use CombineVariants on the VCF files of samples that were from different sequencing runs? Thank you, Ricky Hi, I have used CombineVariants to combine variants from GATK and samtools as shown below: java -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fa --variant:GatkSNP GATKsnp.vcf --variant:GatkINDEL GATKind.vcf --variant:SamSNP Samsnp.vcf --variant:SamINDEL Samind.vcf -o allvar.vcf -genotypeMergeOptions PRIORITIZE -priority GatkSNP,GatkINDEL,SamSNP,SamINDEL --filteredrecordsmergetype KEEP_UNCONDITIONAL This merges all the variants. However, with the above command, i do get the variants present in both GATK and samtools emitted from samtools. I would like to get all the variants such that: • variants present in both GATK and samtools emitted from GATK vcf files • variants in only GATK • variants in only samtools could someone suggest any ideas or of there is something to be fixed in the command. Thanks I am running CombineVariants followed by VariantEval on two VCF files (GATK version 3.1-1). I noticed that none of my "set=FilteredInAll" variants are included in the output of VariantEval. I suspect this is because the FILTER column is not "PASS" because the variants were not "PASS" in either of the input VCF files. Is there an argument to pass into VariantEval so that it does not filter variants out based on the FILTER column? I was using the arguments from a guide article (http://gatkforums.broadinstitute.org/discussion/48/using-varianteval). I saw that in the comments you mentioned an --ignoreFilters argument, but when I tried to use it I got the GATK error: Argument with name 'ignoreFilters' isn't defined. I can just change the FILTER column entries in the combined VCF file, but it seems like a problem that others might have as well. Thank you! Hi I have two vcf files, each with the same chr:pos, but alleles are different:$ tail -1 test1.snp.vcf chr17 41267926 . G A 50.77 . xxx GT:AD:DP:GQ:P 0/1:2,3:6:49:79,0,49 $tail -1 test2.snp.vcf chr17 41267926 . G C 15.88 LowQual xxx GT:AD:DP:GQ:P 0/1:3,2:6:44:44,0,85 After running CombineVariants below$ java -jar GenomeAnalysisTK.jar -T CombineVariants -R ucsc.hg19.fasta -V test1.snp.vcf -V test2.snp.vcf -o ThreeAlleles.snp.vcf -genotypeMergeOptions UNIQUIFY

The ThreeAlleles.snp.vcf only have GT:DP:GQ instead of GT:AD:DP:GQ:PL in the FORMAT field

$tail -1 ThreeAlleles.snp.vcf chr17 41267926 . G A,C 50.77 PASS xxx GT:DP:GQ 0/1:6:49 0/2:6:44 How can I retain the GT:AD:DP:GQ:PL format for tri-allelic positions? Thanks, I was wondering if there is a method for filtering individual genotype calls when using CombineVariants to merge single-called VCF files. The desired behavior that I would like would be a hybrid between the KEEP_IF_ANY_UNFILTERED and KEEP_IF_ALL_UNFILTERED arguments to the -filteredRecordsMergeType. By this, I mean that any site that is unfiltered in any input will remain unfiltered in the output, but for any genotype call from a filtered input should have a filter annotation in the "FT" field of the genotype. I will show a simplified example below (extraneous columns removed from the sample files): Input 1: #CHROM POS ID (...) FILTER FORMAT SAMPLE1 1 11916764 rs79387574 (...) PASS GT:DP 0/0:45  Input 2: #CHROM POS ID (...) FILTER FORMAT SAMPLE2 1 11916764 rs79387574 (...) LowQ GT:DP 0/1:3  Desired Output: #CHROM POS ID (...) FILTER FORMAT SAMPLE1 SAMPLE2 1 11916764 rs79387574 (...) PASS GT:DP:FT 0/0:45:PASS 0/1:3:LowQ  The reason for requesting this is there is occasionally a single sample that may have had a bad call at a site. Using the "KEEP_IF_ALL_UNFILTERED" filters N-1 high quality calls. However, on the other extreme, if we use "KEEP_IF_ANY_UNFILTERED" and only a single sample passes the filters, we introduce N-1 low quality calls and assert that they pass our requisite filters. The requested hybrid method will keep all information from the input samples and allow for better granularity. John Wallace I've noticed a small bug with GATK tools and VCF files. CombineVariants and GenotypeGVCF can generate files where some samples have fewer fields than the format column. For instance, this is part of a line from the VQSR-ed output VCF of GenotypeGVCF: 1 15820 rs200482301 G T 5909.59 VQSRTrancheSNP99.90to100.00 AC=21;AF=0.154;AN=136..... GT:AD:DP:GQ:PL 0/0:.:40:66:0,66,990 0/0:.:41:69:0,69,1035 1/1:0,20,1:21:78:985,78,0 0/0:.:35:60:0,60,900 ./.:.:1 0/0:.:7:21:0,21,233 .............. The second to last sample entry is ./.:.:1 (3 fields), while the format entry is GT:AD:DP:GQ:PL (5 fields). I think that GT=./., AD=., and DP=1, so the data is not getting messed up. This might even be within the rules of VCF, but one of the software that I use will not parse VCF files when 1 < # sample fields < # format fields. If sample entries were extended by ":." for every empty FORMAT field (unless only . or ./. was present in sample column), it would make the file parsable. It's not too hard of a manual fix, but it might be nice to add the functionality into the toolkit. I've seen it happen with CombineVariants as well, when the input VCF files have different numbers of FORMAT fields. Hi, When using HaplotypeCaller's '--emitRefConfidence', with both GVCF and BP_RESOLUTION, we get lines like this one: 1 2337032 rs1129171 C T,<NON_REF> 480.77 . BaseQRankSum=0.218;ClippingRankSum=0.103;DB;DP=45;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.344;ReadPosRankSum=1.046 GT:AD:DP:GQ:PL:SB 0/1:19,26,0:45:99:509,0,330,565,407,97 2:9,10,13,13 The new 3.0 tool GenotypeGVCFs has no problem processing these lines. However CombineVariants fails to run with this message: ##### ERROR MESSAGE: Reference records should not have more than one alternate allele By looking at the relevant code: public/gatk-framework/src/main/java/org/broadinstitute/sting/utils/variant/GATKVariantContextUtils.java if( vc.hasAllele(NON_REF_SYMBOLIC_ALLELE) ) { if( vc.getAlternateAlleles().size() > 1 ) { throw new IllegalStateException("Reference records should not have more than one alternate allele"); } Sounds like the presence of <NON_REF> in lines with ALTs is triggering this exception. I also looked at this forum entry and the example there doesn't have <NON_REF> in lines with ALTs. "Using the reference confidence calculation mode & generating a GVCF" Should I be able to run CombineVariants on files produced using '--emitRefConfidence'? Thanks, Carlos PS: I tried to include links to github and GATK forum, but the spam filter seems not to like that. Is there a way to pass a file containing a list of vcfs to CombineVariants --variant? I tried naming it with .list, but I get a tribble error. I believe that I may have found an issue with the CombineVariants tool of GATK that manifests itself when there is a repeated ID in a given VCF. For us, the reason to have repeated IDs in a VCF file is to detect inconsistencies in our sample by calling variants on 2 different DNA samples and then checking the concordance. Our current process is: 1) Generate a VCF containing unique IDs (using GATK CallVariants) 2) Replace the VCF header with potentially non-unique IDs (using tabix -r) 3) Merge a single VCF to uniqify the IDs (using GATK CombineVariants) It seems that the genotypes in the merged VCF are off by one column. I've attached 3 files that demonstrate the issue: "combined" which is the result of step 1, "combined.renamed", which is the output of step 2, and "combined.renamed.merged", which is the output of step 3. The relevant lines are as follows: combined: HG00421@123910725 HG00422 HG00422@123910706 HG00423@123910701 NA12801 NA12802 0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:127 0/0:299 0/0:127 0/0:299 0/0:293 0/0:127 0/0:127  combined.renamed: HG00421 HG00422 HG00422 HG00423 NA12801 NA12802 0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:127 0/0:299 0/0:127 0/0:299 0/0:293 0/0:127 0/0:127  combined.renamed.merged: HG00421 HG00422 HG00423 NA12801 NA12802 0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:299 0/0:127 0/0:299 0/0:293 0/0:127  Using the depth argument here, we can see that in the merged dataset, NA12801 has depths 290,293 whereas in the original and renamed datasets the depths were 127,127. The 290,293 depths correspond to HG00423, which is the column before. I have confirmed this behavior in both GATK 2.7-4 and 2.8-1. If there's any more information that you need, please let me know, and I would be happy to provide it. Also, if you might know where this issue arises, I would be happy to try to provide a patch. Thanks, John Wallace Hi team, I've been having some issues with DP following CombineVariants: I have two vcf files called by different callers - one by GATK UnifiedGenotyper and the other by samtools mpileup. When I merge the files using CombineVariants, I notice that the DP per each variant is actually equal to the sum of DP of each of the vcf files. For example: If for a shared variant in both vcf files the DP=8, then the DP in the union file will be DP=16. Neverhteless, if a given variant is not shared by both files, then the DP in the union file will be equal to the input file. Is there a way resolve this issue? Thanks! Best, Sagi Hi sir, I used GATK combine variants to call SNP from my 7 vcf files. Now I am using unified genotyper to call all samples simultaneously for SNP. When I focused on my gene of interest (DLX6), in combine variants output, there are 5 "./." among 7 samples. but in case of multisample SNP calling, DLX6 gave 0/0 in those samples and its corresponding information. So, I want to know what is the meaning of "./." in combine variants results and which results should I take for analysis? hello, I am using CombineVariants to combine two multisample vcfs from made from HaplotypeCaller. The vcfs cover the same genomic region, and do not overlap in sample names. I have noticed some potential issues reporting AD and PL fields after running CombineVariants. If the site was multiallelic in one vcf but biallelic in the other vcf, then the AD and PL fields are omitted altogether, such that GT:DP:GQ is what is reported in the combined file - I am guessing this is expected behavior? however, if the site was biallelic in both vcfs, or monomorphic on one vcf but polymorphic in the other, the AD and PL fields are reported, but the numbers do not make sense. my command line: GATK -T CombineVariants -R ~/Capsella_rubella_v1.0_combined.fasta -V sc1_HC_94samp.vcf -V sc1_HC_set2.vcf -nt 20 -o sc1_HC.vcf & original vcf record: scaffold_1 3275 . G T 1170.18 . AC=2;AF=0.011;AN=188;BaseQRankSum=0.295;ClippingRankSum=-1.233;DP=5815;FS=1.186;InbreedingCoeff=-0.0106;MLEAC=1;MLEAF=5.319e-03;MQ=98.00;MQ0=0;MQRankSum=3.414;QD=9.36;ReadPosRankSum=1.761 GT:AD:DP:GQ:PL 0/0:59,0:59:99:0,293,4371 0/0:56,0:56:99:0,218,2973 0/0:40,0:40:99:0,161,2615 0/0:65,0:65:99:0,239,3898 0/0:62,0:62:99:0,227,3612 combined vcf record: scaffold_1 3275 . G T 1170.18 . AC=3;AF=7.979e-03;AN=376;DP=11787;MLEAC=1;MLEAF=5.319e-03;MQ0=0;set=Intersection GT:AD:DP:GQ:PL 0/0:0,0:59:99:0,92,0 0/0:0,42:56:99:3,5,2805 0/0:675,0:40:99:29,26,8482 0/0:45,0:65:99:43,6,11447 0/0:0,187:62:99:4079,0,0 0/0:55,0:27:99:0,182,941 0/0:0,251:69:99:0,162,2300 is this a bug? or am I missing something about how to use CombineVariants? thanks much for your help! YW I'm analyzing seven trio exomes right now with the latest GATK (version 2.7-4-g6f46d11), and was surprised to find a large number of mendelian violations reported by PhaseByTransmission, even after eliminating low/no coverage events. Tracking down the problem, it seems that CombineVariants occasionally propagates the PL field to the new vcf file incorrectly, sometimes in a way which causes GT not to correspond to the lowest PL. Here's an example, showing just the GT, AD, and PL columns for a few positions in one trio. For each position, the first line contains the genotypes from the original vcf file, and the second shows the genotypes from the merged file. #CHROM POS ID REF ALT 100403001-1 100403001-1A 100403001-1B 1 5933530 rs905469 A G 0/0:37,0:0,99,1192 0/0:35,0:0,90,1101 0/0:44,0:0,117,1412 1 5933530 rs905469 A G 0/0:37,0:189,15,1192 0/0:35,0:0,90,1101 0/0:44,0:0,117,1412 1 10412636 rs4846215 A T 0/0:119,0:0,358,4297 0/0:113,0:0,337,4060 0/0:102,0:0,304,3622 1 10412636 rs4846215 A T 0/0:119,0:110,9,0 0/0:113,0:0,337,4060 0/0:102,0:0,304,3622 1 11729035 rs79974326 G C 0/0:50,0:0,141,1709 0/0:53,0:0,150,1788 0/0:71,0:0,187,2246 1 11729035 rs79974326 G C 0/0:50,0:1930,0,3851 0/0:53,0:0,150,1788 0/0:71,0:0,187,2246 1 16735764 rs182873855 G A 0/0:54,0:0,138,1691 0/0:57,0:0,153,1841 0/0:47,0:0,120,1441 1 16735764 rs182873855 G A 0/0:54,0:174,0,1691 0/0:57,0:0,153,1841 0/0:47,0:0,120,1441 1 17316577 rs77880760 G T 0/0:42,0:0,123,1470 0/0:38,0:0,111,1317 0/0:53,0:0,153,1817 1 17316577 rs77880760 G T 0/0:42,0:233,17,1470 0/0:38,0:0,111,1317 0/0:225,25:0,153,181 1 28116000 rs2294229 A G 0/0:37,0:0,105,1291 0/0:37,0:0,111,1379 0/0:30,0:0,87,1066 1 28116000 rs2294229 A G 0/0:37,0:0,105,1291 0/0:37,0:0,111,1379 0/0:30,0:1844,159,0 1 31740706 rs3753373 A G 0/0:123,0:0,349,4173 0/0:110,0:0,319,3793 0/0:111,0:0,328,3885 1 31740706 rs3753373 A G 0/0:123,0:117,6,0 0/0:110,0:0,319,3793 0/0:111,0:0,328,3885  Most genotypes are propagated correctly, and in fact, which a propagated incorrectly changes from run to run. In my case, I'm merging files from disjoint regions, so I can work around the problem, but it would be nice if this were fixed. Thanks, Kevin Hi! This is my code: java -Xmx4g -jar GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar \ -R genome.fa \ --filter_reads_with_N_cigar \ -T CombineVariants \ -V:NORMALSpost RNA-edit_SNPs_NORMALS-post_SNPeff_VA.vfc \ -V:OMNI 1000G_omni2.5.hg19.vcf \ -V:db137 dbsnp_137.hg19.vcf \ -V:Hapmap hapmap_3.3.hg19.vcf \ -V:ESP1,VCF ESP6500SI-V2-SSA137.updatedRsIds.Allchr.snps_indels_FIX.vcf \ --out NORMALS-post_TruePositives_raw.vcf && java -Xmx4g -jar GenomeAnalysisTK.jar \ -R genome.fa \ --filter_reads_with_N_cigar \ -T SelectVariants \ -V NORMALS-post_TruePositives_raw.vcf \ -select "set == 'NORMALSpost'" \ --out NORMALS-post_TruePositives.vcf  I hope I got this right: I now have SNPs (I have only called SNPs, not indels) that is found in my sample, 1000G, dbSNP and ESP. Right? Question: Can I easily make a VCF with all SNPs in MY sample only, not found in anything else by tweaking this code? Or must I do it some way else? Thanks! Hi! Sorry for my ignorance, but the ESP files contains 24 vcd in total. I tried to include them all on CombineVariants, but gets an error. I have tried to merge the 24 file to one, both with vcftools and picard, failing with both.. Somebody knows? Thank you! If I want to merge different VCF files, which I used -L argument for calling variants against to different chromosomes individually with the same list of samples by HaplotypeCaller. I mean the sample are the same, I just used -L to call variants chromosome by chromosome separately. I suppose whether catVariants or CombineVariant will give me the same results, right ? Hi! The title says it all. Im trying to narrow down my vcf. Which tools is good to answer this question? I have tried this: java -Xmx4g -jar GenomeAnalysisTK.jar \ -R genome.fa \ --filter_reads_with_N_cigar \ -T CombineVariants \ -V:COHORT1 path \ -V:COHORT2 path \ --filteredAreUncalled \ -o path But the output seams incomprehensibly massive! Thanks! is there any way to combine a snp vcf and indel vcf (generated with the UnifiedGenotyper) later? in the way that there is only one row per locus? regardless how I combine (I tried mainly CombineVariants), if there is something different called in the two vcf files in one locus, there are two rows in the combined one; I would like this called/written as alternatives for one locus Hi, I'm working with different tools to call variants (GATK is one of them) for the same sample and I'm merging these results with   gatk -T CombineVariants -R ucsc.hg19.fasta -V:GATK GATK.vcf -V:OTHER OTHER.vcf -o combined.vcf -genotypeMergeOptions PRIORITIZE --rod_priority_list GATK,OTHER   But when I have a discrepancy between results (I mean, same chr and pos, but different call), I just get GATK's result in my result file, Is there any way to have both calls when I have discrepancies? I don't want to use -genotypeMergeOptions UNIQUIFY. EXAMPLE:   GATK.vcf: chr17 7917905 . C T 360.16 . [...] GT 0/1 chr17 7918012 . C T 896.16 . [...] GT 0/1     OTHER.vcf: chr17 7917905 . C T 360.16 . [...] GT 0/1 chr17 7918012 . C T 896.16 . [...] GT 1/1     combined.vcf: chr17 7917905 . C T 360.16 . [...];set=Intersection GT 0/1 chr17 7918012 . C T 896.16 . [...];set=Intersection GT 0/1   What I want to obtain:   combined.vcf: chr17 7917905 . C T 360.16 . [...];set=Intersection GT 0/1 chr17 7918012 . C T 896.16 . [...];set=GATK GT 0/1 chr17 7918012 . C T 896.16 . [...];set=OTHER GT 1/1   Hi, I'm calling Variants with HaplotypeCaller in a population of 2 Parents and 7 F1-individuals. After read backed phasing I'm combining the vcf files of my genotypes with CombineVariants. In the outfile I very often find "./.". I thought this means there is no coverage at a certain position. But at many positions I do have good coverage. Why do I then get ./.? Moreover I used FastaAlternateReferenceMaker and created a new reference sequence including the variants from the parents. In that case, after I run HC and do the phasing and combine variants steps, I only get "./." at positions where there is really no coverage (as I can see in my mappings). Nadia HI my two VCF files have different alternate allele at the same position as i have called the variants using two different callers. When i run combine Variants on the both my GT field is not updated properly. AD field is also not updated properly but i ran Variant Annotator and that fixes that issue. VCF 1: chr1 87708015 rs58006838 C T 145.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.479;DB;DP=14;Dels=0.00;FS=0.000;HaplotypeScore=3.9299;MLEAC=1;MLEAF=0.500;MQ=68.54;MQ0=0;MQRankSum=-1.109;QD=10.41;ReadPosRankSum=0.925 GT:AD:DP:GQ:PL 0/1:3,9:14:37:174,0,37  VCF 2: chr1 87708015 . C A . PASS AN=2;DP=4;NS=1 GT:AD:DP:GQ 0/1:2,2:4:8.42  Merged VCF file: chr1 87708015 rs58006838 C T,A 145.77 PASS AC=1,0;AF=0.500,0.00;AN=2;BaseQRankSum=-1.479;DB;DP=18;Dels=0.00;FS=0.000;HaplotypeScore=3.9299;MLEAC=1;MLEAF=0.500;MQ=68.54;MQ0=0;MQRankSum=-1.109;NS=1;QD=10.41;ReadPosRankSum=0.925;set=Intersection GT:AD:DP:GQ **0/1**:3,9,2:14:37  command used: java -jar$gatk/2.7-1-g42d771f/GenomeAnalysisTK.jar -T CombineVariants -V one.vcf.gz -V two.vcf.gz -o test.vcf -R $ref  Is this a known limitation or a bug? Hello! So today I have to compare two different vcf (from different pipelines, only one of them is from GATK) and when I tried to combine it with combineVariants, I had an error because the references doesn't match: The provided variant file(s) have inconsistent references for the same position(s) at chr1:1724719, A* vs. G* I know why it's happening, but I don't know if there is an option/tools to "fix" the vcf from the other pipeline. That I want to do is take the VCF created with a "bad" reference and use a command or GATK option to "fix" it according with the correct reference fasta file. Thanks for your help! I've seen related issues discussed here but not exactly this one. I'm following closely the current recommendations for an exome pipeline, and the GATK version,downloaded from git, was v2.5-2-gf57256b, Compiled 2013/06/06 17:28:57. For example, I have two samples with heterozygous variants 12:81503433C>G. The AD values for the the samples in the raw vcf file, and the SNVs-only file were 15,14 and 20,15 for the two samples and these agree with what I see in IGV. There was nothing in the indels-only file at that position. The AD values were the same in the recalibrated SNVs-only file. But after combining the recalibrated SNVs and indels with CombineVariants the AD values inexplicably became 21,0 and 0,24 respectively. This seems to be happening to many variants. -T CombineVariants -genotypeMergeOptions UNIQUIFY Is UNIQUIFY the default? I want every variant in the output file with each individual in the same column When using CombineVariants, my variants get a FILTER value of either PASS or LowQual. Would it be possible to add an option to CombineVariants which prevents the FILTER value to be set to PASS? Otherwise I have to do some file processing before I run ApplyRecalibration further downstream. It would be great if this was a feature of all walkers and not just VariantFiltration. I'm not sure if the forum is the right place for feature requests. Happy to use Bugzilla or similar instead. Thanks. I have a set of VCFs with identical positions in them: VCF1: 1 10097 . T . 26 . AN=196;DP=1622;MQ=20.06;MQ0=456 GT:DP VCF2: 1 10097 . T . 21.34 . AN=198;DP=2338;MQ=19.53;MQ0=633 GT:DP VCF3: 1 10097 . T . 11.70 . AN=240;DP=3957;MQ=19.74;MQ0=1085 GT:DP VCF4: 1 10097 . T . 15.56 . AN=134;DP=1348;MQ=18.22;MQ0=442 GT:DP If I use all of them as input for VariantRecalibrator, which annotations will VariantRecalibrator use? Should I instead merge the VCFs with CombineVariants and run VariantAnnotator, before I run VariantRecalibrator? I'm not sure if the forum is for asking technical questions only or you are allowed to ask for best practices as well. Feel free to delete my question, if it doesn't belong here. Thank you. Hello, I called variants in chunks (100000 bp chunks) on chromosome 20 on 450 reduced BAMS using -T UnifiedGenotyper, which resulted in the generation of 631 VCFs. Now I want to combine these 631 VCFs using -T CombineVariants, however there seems to be a problem with some of the VCFs and all I get is an error message. I can combine the 5 first VCFs that were generated, but when I add a sixth one (chr20:400000_500000) I get the following error: ##### ERROR MESSAGE: For input string: "20" If I exclude that file, I can combine about 30 VCFs until I encounter a similar error message for a different VCF file: ##### ERROR MESSAGE: For input string: "120" I've looked through both of the "problematic" VCFs and can't see how they differ from the ones that I can combine. Any idea of what may be going wrong and how I can solve this? At the moment I can only identify problematic VCF by trying to combine them using CombineVariants since I don't know what it is about these particular VCF files that is causing the problem. Thanks for your help, Tota Hi. I want to merge two VCF files. Initially I was selected only indels(by select variant option). Now I want to merge these two VCF file which contains only INDELS. But When I run the command, I am getting the same error: ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NumberFormatException: For input string: "."  I run this command: java -jar -Xmx2g GenomeAnalysisTK.jar -R hg19_5.fasta -T CombineVariants -V indelsample1.vcf -V indelsample3.vcf -o indels1s3.vcf -genotypeMergeOptions UNIQUIFY  Could you please tell me what is the reason behind this? and how to merge two VCF file having INDELS? Thanks in advance. I am trying to merge two vcfs (SNVs and INDELs) from the same sample. The problem appears to be that the INDEL vcf defines "combined_sample_name" but the SNV vcf does not. So when I merge I get two sample columns. How can I force GATK to treat them as a single sample? I tried --assumeIdenticalSamples to do a "simple merge," but that made no difference. As a side note, these vcfs are from an Ion Torrent machine. Thanks! java -jar$GATK \
-R $TSP_FILEPATH_GENOME_FASTA \ -T CombineVariants \ --assumeIdenticalSamples \ -V:${baseFolderName} ${i}/SNP_variants.vcf \ -V:${baseFolderName} ${i}/indel_variants.vcf \ -o${RESULTS_DIR}/\${baseFolderName}_variants.vcf


Hi,

I'm just reverse engineering a colleagues script and I've noticed they're using CombineVariants in PRIORITIZE mode but without a -priority argument. I've looked at the documentation and I can't see what the defined behaviour would be in this situation. Would default priority in this situation follow the order of the arguments supplied; the reverse order; or random?

Thanks, Martin

Edit: Nevermind, from what I can see from the source it should be erroring out if -priority is not supplied. I must have missed something in the pipeline script.

Edit 2: No wait

    if ( genotypeMergeOption == VariantContextUtils.GenotypeMergeType.PRIORITIZE && PRIORITY_STRING == null )
throw new UserException.MissingArgument("rod_priority_list", "Priority string must be provided if you want to prioritize genotypes");


is pointless because this is run first in initialize:

    if ( PRIORITY_STRING == null ) {
PRIORITY_STRING = Utils.join(",", vcfRods.keySet());
logger.info("Priority string not provided, using arbitrary genotyping order: " + PRIORITY_STRING);
}


This should follow the input order yes? Unless vcfRods.keySet is sorted?

Hi,

I used Beagle to phase my data but for some indels, I have some probleme :

example :

Input vcf :

2       68599872        .       ATG     A       14.40   PASS    AC=1;AC1=1;AF=0.028


Input for beagle created by ProduceBeagleInput:

2:68599872 TG - 1.0000 0.0000 0.0000 ......


Output vcf created by BeagleOutputToVCF:

2       68599872        .       ATG     .       14.40   BGL_RM_WAS_-    AC1=1;AF1=0.02965.....


error message by CombineVariants:

MESSAGE: Badly formed variant context at location 68599872 in contig 2. Reference length must be at most one base shorter than location size


Can you help me?

Tipahine

Dear team, I am new to GATK and I am having a hard time simply trying to merge vcf files. I have tried to solve the problem by referring to the guide and to previous posts, but nothing woked. Actually I found several discussions about the very same error message I receive, but it seems that no clear answere was provided. Here is this message:

ERROR ------------------------------------------------------------------------------------------

I have tried three different MS Dos commands to do the task (see belbow), but the message didn't change:

1. java -jar GenomeAnalysisTK.jar -T CombineVariants -R E:\RessourcesGATK\ucsc.hg19.fasta -V:sample1 E:\TestGATK\sample1.vcf -V:sample2 E:\TestGATK\sample2.vcf -o combined.vcf

2. java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants  --variant E:\TestGATK\sample1.vcf  --variant E:\TestGATK\sample2.vcf  -o output.vcf  -genotypeMergeOptions UNIQUIFY

3.java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta  -T CombineVariants  --variant E:\TestGATK\sample1.vcf  --variant E:\TestGATK\sample2.vcf  -o output.vcf  -genotypeMergeOptions PRIORITIZE  -priority foo,bar


I have also tried to use the reference human_g1k_v37.fasta as a resource, but it was the same. I have suppressed the # before CHROM in the header line, tested vcf generated by Samtools or by GATK, but it did not change anything. Is this a problem of environment? I haven't read anything mentioning that GATK could not work with MS Dos.

Thank you very much for your help. S.

When I run CombineVariants on two vcf files with variants at the same position but with different REF alleles and also different sets of ALT alleles, the AD fields for the genotypes are not updated to reflect the changes in the ALT field. The REF and ALT fields and the GT field for each genotype are all correctly updated. For example combining

3   10128965    rs71052293  CTT CT,C,CTTT   19936.43    PASS    AC=1,1,1;AF=0.25,0.25,0.25;AN=4 GT:AD:DP:GQ:PL  0/2:115,0,33,12:230:6.96:980,1237,2795,0,946,1900,7,679,467,817 3/1:97,13,20,16:229:99:804,221,832,581,176,3047,521,0,1653,1595

and

3   10128965    rs71052293  CT  C,CTT,CTTT  14280.61    PASS    AC=1,1,1;AF=0.25,0.25,0.25;AN=4 GT:AD:DP:GQ:PL  2/1:110,20,33,18:237:1.90:850,289,1027,457,0,1487,147,877,2,1858    0/3:80,48,5,29:209:99:1835,875,977,2101,1119,3322,0,142,331,462

gives

3   10128965    rs71052293  CTT CT,C,CTTT,CTTTT 19936.43    PASS    AC=2,1,2,1;AF=0.250,0.125,0.250,0.125;AN=8;set=Intersection GT:AD:DP:GQ 0/2:115,0,33,12:230:7   3/1:97,13,20,16:229:99  3/1:110,20,33,18:237:2  0/4:80,48,5,29:209:99


There five alleles (one REF and four ALT) but only four AD fields for each genotype.

My command line:

java -jar -Xmx4g GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V test_input1.vcf -V test_input2.vcf -o test_combined.vcf


Is this a known limitation or a bug?