Tagged with #combinevariants
2 documentation articles | 3 announcements | 46 forum discussions


Comments (6)

A new tool has been released!

Check out the documentation at CombineVariants.

Comments (18)

1. About CombineVariants

This tool combines VCF records from different sources. Any (unique) name can be used to bind your rod data and any number of sources can be input. This tool currently supports two different combination types for each of variants (the first 8 fields of the VCF) and genotypes (the rest)

For a complete, detailed argument reference, refer to the GATK document page here.

2. Logic for merging records across VCFs

CombineVariants will include a record at every site in all of your input VCF files, and annotate which input ROD bindings the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. However, any part of the Venn of the N merged VCFs can be exacted using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

3. Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the Tech Doc page for the tool.

4. Understanding the set attribute

The set INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

  • set=Intersection : occurred in both call sets, not filtered out

  • set=NAME : occurred in the call set NAME only

  • set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2

  • set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

5. Changing the set key

You can use -setKey foo to change the set=XXX tag to foo=XXX in your output. Additionally, -setKey null stops the set tag=value pair from being emitted at all.

6. Minimal VCF output

Add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. The only fields emitted will be GT:GQ for genotypes and the keySet for INFO

An even more extreme output format is -sites_only, a general engine capability, where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

7. Combining Variant Calls with a minimum set of input sites

Add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. Useful, for example in combining several data sets where we only want to keep sites present in for example at least 2 of them (in which case -minN 2 should be added to the command line).

8. Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf

This results in two vcf files, which look like:

==> union.vcf <==
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       990882  SNP1-980745     C       T       .       PASS    CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1       990984  SNP1-980847     G       A       .       PASS    CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1       992265  SNP1-982128     C       T       .       PASS    CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1       992819  SNP1-982682     G       A       .       id50    CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1       993987  SNP1-983850     T       C       .       PASS    CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1       994391  rs2488991       G       T       .       PASS    AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1       996184  SNP1-986047     G       A       .       PASS    CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1       999649  SNP1-989512     G       A       .       PASS    CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni

==> intersect.vcf <==
1       950243  SNP1-940106     A       C       .       PASS    AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1       957640  rs6657048       C       T       .       PASS    AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1       959842  rs2710888       C       T       .       PASS    AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1       977780  rs2710875       C       T       .       PASS    AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1       985900  SNP1-975763     C       T       .       PASS    AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1       987200  SNP1-977063     C       T       .       PASS    AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1       987670  SNP1-977533     T       G       .       PASS    AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1       990417  rs2465136       T       C       .       PASS    AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
Comments (2)

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.


Unified Genotyper

  • Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

Haplotype Caller

  • Improved the indexing scheme for gVCF outputs using the reference calculation model.
  • The reference calculation model now works with reduced reads.
  • Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
  • Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

Variant Recalibrator

  • Disable tranche plots in INDEL mode.
  • Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

Reduce Reads

  • Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

Diagnose Targets

  • Added calculation for GC content.
  • Added an option to filter the bases based on their quality scores.

Combine Variants

  • Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

Select Variants

  • Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.

Miscellaneous

  • SplitSamFile now produces an index with the BAM.
  • Length metric updates to QualifyMissingIntervals.
  • Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
  • Picard jar updated to version 1.104.1628.
  • Tribble jar updated to version 1.104.1628.
  • Variant jar updated to version 1.104.1628.
Comments (2)

GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Base Quality Score Recalibration

  • Improved the algorithm around homopolymer runs to use a "delocalized context".
  • Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
  • Fixed bug where the tool failed for reads that begin with insertions.
  • Fixed bug in the scatter-gather functionality.
  • Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).

Unified Genotyper

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
  • Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
  • Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
  • Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
  • Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
  • The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
  • Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
  • Generalized ploidy model now handles reference calls correctly.

Haplotype Caller

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Now requires at least 10 samples to merge variants into complex events.

Variant Annotator

  • Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.

Reduce Reads

  • Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
  • Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
  • Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.

Variant Filtration

  • Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.

Variant Eval

  • AlleleCount stratification now supports records with ploidy other than 2.

Combine Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Now outputs the first non-missing QUAL, not the maximum.

Select Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Removed the -number argument because it gave biased results.

Validate Variants

  • Added option to selectively choose particular strict validation options.
  • Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
  • improved the error message around unused ALT alleles.

Somatic Indel Detector

  • Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).

Miscellaneous

  • New CPU "nano" parallelization option (-nct) added GATK-wide (see docs for more details about this cool new feature that allows parallelization even for Read Walkers).
  • Fixed raw HapMap file conversion bug in VariantsToVCF.
  • Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
  • Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
  • Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
  • Fixed bug in BCF2 writer for case where all genotypes are missing.
  • Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
  • Fixed bug in Phase By Transmission when there are no likelihoods present.
  • Fixed bug in fasta .fai generation.
  • Updated and improved version of the BadCigar read filter.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar remains at version 110.
Comments (0)

Base Quality Score Recalibration

  • Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
  • Implemented support for SOLiD no call strategies other than throwing an exception.
  • Fixed smoothing in the BQSR bins.
  • Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.

Unified Genotyper

  • Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
  • UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
  • Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
  • In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
  • Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.

Haplotype Caller

  • Added LowQual filter to the output when appropriate.
  • Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
  • Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
  • Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
  • Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
  • Fixed bug where non-standard bases from the reference would cause errors.
  • Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.

Reduce Reads

  • Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
  • Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
  • Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.

Variant Eval

  • Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
  • Fixed incorrect allele counting in IndelSummary evaluation.

Combine Variants

  • Now outputs the first non-MISSING QUAL, instead of the maximum.
  • Now supports multi-threaded running (with the -nt argument).

Select Variants

  • Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
  • No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
  • If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).

Miscellaneous

  • Updated and improved the BadCigar read filter.
  • GATK now generates a proper error when a gzipped FASTA is passed in.
  • Various improvements throughout the BCF2-related code.
  • Removed various parallelism bottlenecks in the GATK.
  • Added support of X and = CIGAR operators to the GATK.
  • Catch NumberFormatExceptions when parsing the VCF POS field.
  • Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
  • Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
  • We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
  • Added support for handling complex events in ValidateVariants.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar remains at version 110.
Comments (3)

Dear all,

I try to use the CombineVariants to combine about 200 samples like this:

java -Xmx30g -jar GenomeAnalysisTK.jar -R N.tab.all.fa -T CombineVariants --variant S046.vcf --variant S237.vcf --variant S129.vcf --variant S126.vcf --variant S209.vcf --variant S193.vcf --variant S158.vcf --variant S174.vcf --variant S001.vcf.idx --variant S135.vcf --variant S222.vcf --variant S101.vcf --variant S009.vcf --variant S143.vcf --variant S232.vcf --variant S206.vcf --variant S207.vcf --variant S210.vcf --variant S039.vcf --variant S072.vcf --variant S231.vcf --variant S133.vcf --variant S137.vcf --variant S139.vcf --variant S081.vcf --variant S246.vcf --variant S224.vcf --variant S121.vcf --variant S055.vcf --variant S114.vcf --variant S175.vcf --variant S131.vcf --variant S238.vcf --variant S060.vcf --variant S090.vcf --variant S053.vcf --variant S177.vcf --variant S185.vcf --variant S202.vcf --variant S240.vcf --variant S025.vcf --variant S031.vcf --variant S250.vcf --variant S227.vcf --variant S223.vcf --variant S155.vcf --variant S235.vcf --variant S024.vcf --variant S014.vcf --variant S243.vcf --variant S056.vcf --variant S144.vcf --variant S136.vcf --variant S033.vcf --variant S005.vcf --variant S059.vcf --variant S047.vcf --variant S122.vcf --variant S084.vcf --variant S119.vcf --variant S216.vcf --variant S023.vcf --variant S028.vcf --variant S016.vcf --variant S219.vcf --variant S061.vcf --variant S088.vcf --variant S248.vcf --variant S196.vcf --variant S094.vcf --variant S186.vcf --variant S204.vcf --variant S075.vcf --variant S205.vcf --variant S079.vcf --variant S029.vcf --variant S161.vcf --variant S140.vcf --variant S096.vcf --variant S117.vcf --variant S118.vcf --variant S150.vcf --variant S097.vcf --variant S148.vcf --variant S153.vcf --variant S178.vcf --variant S040.vcf --variant S063.vcf --variant S099.vcf --variant S179.vcf --variant S073.vcf --variant S168.vcf --variant S154.vcf --variant S230.vcf --variant S213.vcf --variant S057.vcf --variant S098.vcf --variant S247.vcf --variant S244.vcf --variant S228.vcf --variant S242.vcf --variant S002.vcf.idx --variant S054.vcf --variant S008.vcf --variant S087.vcf --variant S199.vcf --variant S108.vcf --variant S019.vcf --variant S241.vcf --variant S006.vcf --variant S208.vcf --variant S212.vcf --variant S080.vcf --variant S187.vcf --variant S123.vcf --variant S043.vcf --variant S163.vcf --variant S124.vcf --variant S183.vcf --variant S049.vcf --variant S038.vcf --variant S149.vcf --variant S036.vcf --variant S211.vcf --variant S157.vcf --variant S142.vcf --variant S184.vcf --variant S078.vcf --variant S085.vcf --variant S050.vcf --variant S091.vcf --variant S176.vcf --variant S214.vcf --variant S215.vcf --variant S083.vcf --variant S062.vcf --variant S065.vcf --variant S032.vcf --variant S125.vcf --variant S156.vcf --variant S182.vcf --variant S086.vcf --variant S255.vcf --variant S254.vcf --variant S001.vcf --variant S162.vcf --variant S181.vcf --variant S192.vcf --variant S169.vcf --variant S074.vcf --variant S173.vcf --variant S042.vcf --variant S106.vcf --variant S003.vcf --variant S207_supp.vcf --variant S071.vcf --variant S2.vcf --variant S035.vcf --variant S037.vcf --variant S127.vcf --variant S165.vcf --variant S170.vcf --variant S027.vcf --variant S141.vcf --variant S252.vcf --variant S234.vcf --variant S191.vcf --variant S197.vcf --variant S147.vcf --variant S058.vcf --variant S172.vcf --variant S010.vcf --variant S138.vcf --variant S066.vcf --variant S041.vcf --variant S082.vcf --variant S171.vcf --variant S132.vcf --variant S015.vcf --variant S200.vcf --variant S069.vcf --variant S220.vcf --variant S218.vcf --variant S076.vcf --variant S134.vcf --variant S151.vcf --variant S130.vcf --variant S221.vcf --variant S120.vcf --variant S145.vcf --variant S011.vcf --variant S217.vcf --variant S159.vcf --variant S067.vcf --variant S021.vcf --variant S020.vcf --variant S104.vcf --variant S128.vcf --variant S253.vcf --variant S225.vcf --variant S107.vcf --variant S201.vcf --variant S034.vcf --variant S203.vcf --variant S116.vcf --variant S167.vcf --variant S115.vcf --variant S002.vcf --variant S003.vcf.idx --variant S064.vcf --variant S012.vcf --variant S111.vcf --variant S017.vcf --variant S189.vcf --variant S236.vcf --variant S070.vcf --variant S113.vcf --variant S007.vcf --variant S112.vcf --variant S018.vcf --variant S229.vcf --variant S245.vcf --variant S093.vcf --variant S251.vcf --variant S166.vcf --variant S164.vcf --variant S239.vcf --variant S030.vcf --variant S004.vcf --variant S095.vcf --variant S110.vcf --variant S188.vcf --variant S109.vcf --variant S013.vcf --variant S051.vcf --variant S100.vcf --variant S194.vcf --variant S160.vcf --variant S190.vcf --variant S105.vcf --variant S146.vcf --variant S198.vcf --variant S092.vcf --variant S044.vcf --variant S052.vcf --variant S026.vcf --variant S249.vcf --variant S180.vcf --variant S103.vcf --variant S022.vcf --variant S089.vcf --variant S102.vcf --variant S077.vcf --variant S068.vcf --variant S048.vcf --variant S195.vcf --variant S152.vcf --variant S045.vcf --variant S226.vcf -o output_SNP.vcf -genotypeMergeOptions UNIQUIFY

However, there always some error message:

INFO 05:05:38,893 HelpFormatter - -------------------------------------------------------------------------------- INFO 05:05:38,893 HelpFormatter - -------------------------------------------------------------------------------- INFO 05:05:39,871 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
ERROR Name FeatureType Documentation
ERROR BCF2 VariantContext (this is an external codec and is not documented within GATK)
ERROR VCF VariantContext (this is an external codec and is not documented within GATK)
ERROR VCF3 VariantContext (this is an external codec and is not documented within GATK)
ERROR ------------------------------------------------------------------------------------------

Can anyone give me some suggestions?

Jingjing

Comments (1)

Question: Is it possible to have CV merge like bcftools does it?

I get this warning, when running UG in GGA mode using an -alleles vcf generated with CV:

WARN  10:17:21,394 GenotypingGivenAllelesUtils - Multiple valid VCF records detected in the alleles input file at site 20:106089, only considering the first record 

I made this call with HC from 10 samples:

20  106089  .   CA  C

And this call with UG from 10 other samples:

20  106089  .   C   A

CV merges like this:

20  106089  .   C   A
20  106089  .   CA  C

bcftools merges like this:

20  106089  .   CA  AA,C

The UG recall from the CV generated -alleles vcf is incomplete:

20  106089  .   C   A

The UG recall from the bcftools generated -alleles vcf is complete:

20  106089  .   CA  AA,C

Is it possible to have CV merge like bcftools does it?

In another thread @Geraldine_VdAuwera said:

I'm really not sure. It's not a use case that UG was designed for (with UG we kept SNPs and indels separate until post-analysis), so I would recommend being cautious with it.

I checked the genotypes and UG seems to handle merged MNPs and indels just fine; see below. But I will do some additional testing. Or I might just take the safe path and do the recalling separately for SNPs and indels as suggested. The reason I have UG and HC calls in the first place is because I have low and high coverage data for different cohorts. I want to create a merged dataset.

Despite --interval_padding 100 helping to recall more sites with HC in GGA mode as per previous recommendation, some sites still fail to be called with HC in GGA mode. Hence I opted for UG.

UG calls on samples 1-10:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578
20  106089  .   C   A   16.19   .   AC=2;AF=0.125;AN=16;BaseQRankSum=-0.854;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=1.5282;MLEAC=2;MLEAF=0.125;MQ=58.74;MQ0=0;MQRankSum=-0.560;QD=2.70;ReadPosRankSum=-1.797;SOR=0.935;VariantType=SNP  GT:AD:DP:GQ:PL  0/0:3,0:3:6:0,6,76  0/0:4,2:6:9:0,9,115 0/1:3,1:4:24:24,0,80    0/0:6,0:6:12:0,12,130   0/1:1,1:2:29:30,0,29    ./. 0/0:7,0:7:15:0,15,188   0/0:3,1:4:6:0,6,74  ./. 0/0:5,0:5:12:0,12,142

HC calls on samples 11-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  585 590 622 625 628 640 655 668 687 693

20  106089  .   CA  C   47.95   .   AC=5;AF=0.250;AN=20;BaseQRankSum=0.925;DP=36;FS=1.850;InbreedingCoeff=0.0646;MLEAC=5;MLEAF=0.250;MQ=59.48;MQ0=0;MQRankSum=0.175;QD=3.00;ReadPosRankSum=-1.725;SOR=0.387 GT:AD:GQ:PL 0/0:2,0:6:0,6,49    0/0:2,0:6:0,6,49    0/0:3,0:12:0,12,130 0/0:5,0:15:0,15,122 0/0:2,0:6:0,6,46    0/1:2,1:14:14,0,39  0/1:2,1:15:15,0,38  0/0:4,0:12:0,12,93  0/1:3,1:12:12,0,46  1/1:0,3:9:67,9,0

UG GGA recalls on samples 1-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578 585 590 622 625 628 640 655 668 687 693
20  106089  .   CA  AA,C    110.56  .   AC=0,8;AF=0.00,0.222;AN=36;DP=81;FS=0.000;InbreedingCoeff=0.5076;MLEAC=0,6;MLEAF=0.00,0.167;MQ=58.56;MQ0=0;QD=3.45;SOR=0.859;VariantType=MULTIALLELIC_MIXED GT:AD:DP:GQ:PL:SB   0/0:0,0,0:3:0:0,0,0,6,6,52:0,0,0,0  0/2:0,0,1:6:0:5,5,5,0,0,109:0,0,1,0 0/2:0,0,1:4:0:12,12,12,0,0,47:0,0,1,0   0/0:0,0,0:6:0:0,0,0,17,17,123:0,0,0,0   0/0:0,0,0:2:0:0,0,0,3,3,10:0,0,0,0  ./. 0/0:0,0,0:7:0:0,0,0,9,9,60:0,0,0,0  0/2:0,0,1:4:0:12,12,12,0,0,61:0,0,0,1   ./. 0/0:0,0,1:5:0:0,0,0,4,4,30:0,0,0,1  0/0:0,0,0:3:0:0,0,0,6,6,49:0,0,0,0  0/0:0,0,0:3:0:0,0,0,9,9,76:0,0,0,0  0/0:0,0,1:4:0:0,0,0,1,1,22:0,0,1,0  0/0:0,0,0:7:0:0,0,0,18,18,149:0,0,0,0   0/0:0,0,0:4:0:0,0,0,11,11,76:0,0,0,0    0/2:0,0,1:5:0:9,9,9,0,0,65:0,0,0,1  0/2:0,0,1:4:0:12,12,12,0,0,60:0,0,0,1   0/0:0,0,0:5:0:0,0,0,15,15,116:0,0,0,0   0/2:0,0,1:6:0:12,12,12,0,0,47:0,0,0,1   2/2:0,0,3:3:9:67,67,67,9,9,0:0,0,3,0

This thread is related to the following threads on GGA:

http://gatkforums.broadinstitute.org/discussion/5249/overcalling-deletion-in-unifiedgenotyper-genotype-given-alleles-mode
http://gatkforums.broadinstitute.org/discussion/5018/ug-call-combined-snp-indel-sites-in-gga-mode
http://gatkforums.broadinstitute.org/discussion/4936/not-all-sites-emitted-with-genotype-given-alleles
http://gatkforums.broadinstitute.org/discussion/4024/genotype-and-validate-or-haplotype-caller-gga-what-am-i-doing-wrong

P.S. I might gate crash your Cambridge party this week despite not being invited :D The course was already fully booked, when you announced it. I don't have a time machine!

Comments (3)

Hi all, I have 2 DNAseq batches of samples (more batches are ongoing sequencing, i.e. cohort). I used HaplotypeCaller per sample and got 2 batches of gvcfs, one gvcf per sample. But I am not sure which of the following order is best: (CombineGVCFs per batch) -> GenotypeGVCFs (obtain a single vcf) -> VariantRecalibration, or GenotypeGVCFs per batch -> VariantRecalibration per batch -> CombineVariant (obtain a single vcf). What is the difference between these two orders? and which one is more suitable for my situation? Hoping for any suggestions. Thank you very much ! Best, Hiu

Comments (6)

Currently I am following GATK best practice for using HC 3.0+, however I'm splitting my calls to chromosomal regions (-L). Next are the following step I perform working up to GenotypeGVCF and my question.

1 - I use CatVariants (following HC) to merge all 25 chromosome gvcf files into a single gvcf file per individual.
2 - I use CombineGVCF to merge 2 .. n number of individuals together. This is done because some analysis have 300+ individuals. 3- I then use CombineGVCF again to merge all the file from step 2 into one large gvcf file for one large joint GenotypeGVCF step. 4 - GenotypeGVCF is done again based on chromosomal regions (-L), which is followed by a additional CatVariants before VQSR.

The question I have this this: Given the size of the analysis I have noticed that my CombineGVCF done in step 3 can take anywhere from 4-8 hours. I was wondering if I could change this step to use CombineVariants and have the result be the same (unlost data). The main reason for this would be because GATK currently allow CombineVariants to use the -nt option.

Thanks for you time and work.

--Shawn

Comments (3)

Is there a page, where I can see an example of what CombineVariants does with and without --printComplexMerges? I'm wondering whether to use bcftools merge or GATK CombineVariants.

Comments (3)

Dear GATK team,

I am running the following command:

java -jar GenomeAnalysisTK_v33.jar -R genome.fa -T CombineVariants --variant file1.vcf --variant file2.vcf --variant file3.vcf --variant file4.vcf --variant file5.vcf --variant file6.vcf --variant file7.vcf --variant file8.vcf --variant file9.vcf --variant file10.vcf --minimumN 10 -o combined.vcf -genotypeMergeOptions UNIQUIFY

The output vcf file has the header for the sample columns:

sample.variant sample.variant10 sample.variant2 sample.variant3 sample.variant4 sample.variant5 sample.variant6 sample.variant7 sample.variant8 sample.variant9

What do these numbers represent? For example, does sample.variant10 equal to the 10th VCF file (it would be file10.vcf in this case) and so on?

Thank you!

Joe

Comments (5)

Hey GATK Team!

I'm attempting to merge identical sample sets called on different (disjoint) chromosomes with the CombineVariants tool with the --assumeIdenticalSamples flag enabled, but the v3.3-0 tool is behaving as if this option is not enabled and it requests I specify a --genotypemergeoption. When I execute the same exact command using v3.2-2, the tool runs to completion without error.

Enabling --genotypemergeoption UNIQIFY with v3.3-0 just to see what happens (while --assumeIdenticalSamples is still enabled) outputs each sample 18 times (once for each chromosome) without any loss in the number of loci (sum of loci in individual files equals sum in combined output file).

I believe this may be a bug, but I could be wrong...

INFO  13:10:49,188 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:10:49,190 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 
INFO  13:10:49,191 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  13:10:49,191 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  13:10:49,194 HelpFormatter - Program Args: -T CombineVariants -R genome.fasta -V ./homVar.list --out homVar.SNP.vcf --suppressCommandLineHeader --assumeIdenticalSamples 
INFO  13:10:49,198 HelpFormatter - Executing as bredeson@gpint104.nersc.gov on Linux 2.6.32-431.20.3.el6.nersc.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. 
INFO  13:10:49,198 HelpFormatter - Date/Time: 2014/12/28 13:10:49 
INFO  13:10:49,199 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:10:49,199 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:10:49,731 GenomeAnalysisEngine - Strictness is SILENT 
INFO  13:10:50,015 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  13:10:50,731 GenomeAnalysisEngine - Preparing for traversal 
INFO  13:10:50,751 GenomeAnalysisEngine - Done preparing for traversal 
INFO  13:10:50,752 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  13:10:50,752 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  13:10:50,752 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  13:10:51,578 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Duplicate sample names were discovered but no genotypemergeoption was supplied. To combine samples without merging specify --genotypemergeoption UNIQUIFY. Merging duplicate samples without specified priority is unsupported, but can be achieved by specifying --genotypemergeoption UNSORTED.
##### ERROR ------------------------------------------------------------------------------------------


INFO  13:02:06,270 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:02:06,272 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 
INFO  13:02:06,272 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  13:02:06,272 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  13:02:06,275 HelpFormatter - Program Args: -T CombineVariants -R genome.fasta -V ./homVar.list --out homVar.SNP.vcf --suppressCommandLineHeader --assumeIdenticalSamples 
INFO  13:02:06,279 HelpFormatter - Executing as bredeson@gpint104.nersc.gov on Linux 2.6.32-431.20.3.el6.nersc.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. 
INFO  13:02:06,280 HelpFormatter - Date/Time: 2014/12/28 13:02:06 
INFO  13:02:06,280 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:02:06,280 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:02:06,812 GenomeAnalysisEngine - Strictness is SILENT 
INFO  13:02:07,098 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  13:02:07,818 GenomeAnalysisEngine - Preparing for traversal 
INFO  13:02:07,837 GenomeAnalysisEngine - Done preparing for traversal 
INFO  13:02:07,838 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  13:02:07,838 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  13:02:07,838 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  13:02:07,840 CombineVariants - Priority string is not provided, using arbitrary genotyping order: null 
INFO  13:02:10,798 ProgressMeter -            done      2967.0     2.0 s      16.6 m       88.9%     2.0 s       0.0 s 
INFO  13:02:10,799 ProgressMeter - Total runtime 2.96 secs, 0.05 min, 0.00 hours 
INFO  13:02:11,589 GATKRunReport - Uploaded run statistics report to AWS S3
Comments (8)

Hi GATK team,

I am attempting to combine a HaplotypeCaller generated VCF with some indels called using pindel using the following arguments (GATK v3.3-0-g37228af):

-R /data/shared/ref/b37/human_g1k_v37.fasta -T CombineVariants --variant:GATK var.HiSeqDecember.raw.vcf --variant:pindel pindel_combined.vcf -o var.HiSeqDecember.pindel.raw.vcf -genotypeMergeOptions PRIORITIZE -priority GATK,pindel

However I get the following error:

ERROR MESSAGE: Badly formed variant context at location 1:157718231; getEnd() was 157718235 but this VariantContext contains an END key with value 157718231

The variants in question are (from GATK):

1 157718231 . CAAAT C,CAAATAAAT 2533.56 PASS AC=3,13;AF=0.125,0.542;AN=24;BaseQRankSum=1.762;ClippingRankSum=-0.327;DP=126;FS=0.000;HOMLEN=39;HOMSEQ=AAATAAATAAATAAATAAATAAATAAATAAATAAATAAA;InbreedingCoeff=-0.1260;MLEAC=3,13;MLEAF=0.125,0.542;MQ=70.00;MQ0=0;MQRankSum=0.920;QD=22.22;ReadPosRankSum=-0.893;SOR=0.976;SVLEN=4;SVTYPE=INS;set=Intersection GT:DP:GQ 0/0:10:30 0/2:9:18 2/2:6:18 2/2:5:15 0/1:10:99 0/2:14:99 2/2:8:24 2/2:6:18 2/2:7:21 0/2:17:99 0/1:5:75 0/1:6:27

and (from pindel):

1 157718231 . C CAAAT . PASS AC=2;AF=0.143;AN=14;END=157718231;HOMLEN=39;HOMSEQ=AAATAAATAAATAAATAAATAAATAAATAAATAAATAAA;SVLEN=4;SVTYPE=INS;set=variant3-variant4-variant6-variant7-variant8-variant9-variant10 GT:AD ./. ./. 0/0:0,7 0/0:0,6 ./. 0/0:0,9 0/0:0,8 0/0:0,7 0/0:0,8 1/1:0,12 ./. ./.

It is worth noting that the pindel VCF here was merged together from several pindel-generated VCFs using CombineVariants without any complaint from the GATK. It looks to me that the END key is correct for the pindel variant (a simple insertion), but the GATK is confused due to the mixed deletion/insertion variant generated by the HaplotypeCaller at the same position (without an END key).

I can rerun the command after stripping all END tags from the pindel VCF and the command completes successfully, so this is not a showstopper for me but I assume this is a bug(?) and if so, it would be great if there were a fix.

Cheers,

Dave

Comments (6)

Hi GATK team, i would like to seek opinion from your team to find the best workflow that best fit my data. Previously i've been exploring both variant calling algorithms UnifiedGenotyper and HaplotypeCaller, and i would love to go for UnifiedGenotyper considering of the sensitivity and the analysis runtime. Due to my experimental cohort samples grows from time to time, so i've opt for single sample calling follow by joint-analysis using combineVariants instead of doing multiple-samples variant calling. However by doing so, i've experience few drawbacks from it (this issue was discussed at few forums). For a particular SNP loci, we wouldn't know whether the "./." reported for some of the samples are due to no reads covering that particular loci, or it doesn't pass certain criteria during variant calling performed previously, or it is a homo-reference base (which i concern this most and can't cope to lost this information).

Then, i found this "gvcf", and it is potentially to solve my problem (Thanks GATK team for always understand our researcher's need)!! Again, i'm insist of opt for unifiedGenotyper instead of haplotypeCaller to generate the gvcf, and reading from the forum at https://www.broadinstitute.org/gatk/guide/tagged?tag=gvcf, i would assume that as VCFs produced by "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" to be something alike with the gvcf file produced by HaplotyperCaller. However i couldn't joint them up using either "CombineVariants" or "CombineGVCFs", most probably i think "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" doesn't generate gvcf format.

Can you please give me some advice to BEST fit my need and with minimum runtime (UnifiedGenotyper would be my best choice), is there any method to joint the ALL_SITES vcf file produced by UnifiedGenotyper which i might probably missed out from the GATK page?

Comments (3)

Hello everyone, I recenltly ran -T CombineVariants on a set of samples and I noticed that the allele frequency (AF) in the info field changed. In the individual VCF, AF=0.3 for the particular variant I was looking at, however, after I combined multiple VCFs, AF=0.5.... however the variant was only seen in the one sample, so how could the frequency change? Any insight is appreciated.

Thank you,

Ricky

Comments (8)

Hi,

I am merging several genome vcf files apparently created by gvcftools [1]. I noticed that CombineVariants is incorrectly summing allele counts (AC) in some corner cases, and was wondering if the format [2] was supported?

If the format should be supported by the CombineVariants tool (gvcf is VCF 4.1 conformant), I have isolated a minimal example that should be of help for bugfixing.

[1] https://sites.google.com/site/gvcftools/home [2] https://sites.google.com/site/gvcftools/home/about-gvcf

Thanks, Paweł

Comments (5)

Hi GATK team,

I was working on generating a combined VCF using 150+ VCFs (building the sort of cohort). The purpose of it is to calculate variants cohort frequency. But I found the genotype is messed up in the combined VCF. Here is my cmd line:

java -jar /GATK/GenomeAnalysisTK-2.7-4/GenomeAnalysisTK.jar -R refernce.fasta \

-T CombineVariants \

--variant sample1.vcf \

--variant sample2.vcf \

-o combined.vcf

Here is one example of one variant/ position in the combined VCF. The record is very long in the combined VCF, I just grabbed the related columns here.

1 22082967 rs35545280 CAAA CAA,C,CA,CAAAA 76.73 PASS AC=109,6,104,6;AF=0.368,0.020,0.351,0.020;AN=296;DB;DP=9869;GC=48.13;MQ0=0;PercentNBaseSolid=0.0000;RU=A;STR;set=filterInvariant-filterInvariant2-filterInvariant3… GT:DP:GQ

In this record, sample 1 has this variant and it shows as "0/3:46:99",

but in the sample1.vcf, it is listed as

1 22082967 rs35545280 CA C 88.73 Low_Confidence AC=2;AF=1.00;AN=2;BaseCounts=0,76,0,0;BaseQRankSum=1.905;DB;DP=76;FS=0.000;GC=48.13;HaplotypeScore=195.0572;IndelType=DEL.NOVEL_2.;LowMQ=0.0000,0.0000,76;MLEAC=1;MLEAF=0.500;MQ=68.40;MQ0=0;MQRankSum=0.423;PercentNBaseSolid=0.0000;QD=1.17;RPA=20,19;RU=A;ReadPosRankSum=-0.741;STR;set=FilteredInAll GT:AD:DP:GQ:PL 1/1:0,17:76:19:587,19,0

And you can see that the genotype in combined VCF for sample 1 is 0/3, but in its original its is 1/1 which is homozygous. So when I calculate the cohort frequency, I'm confused on matching genotype of this variant for sample 1.

To give you more idea, I listed another sample of same variant in combined.VCF and its record in sample2.VCF.

In the combined.VCF, sample 2 shows as "0/3:91:99".

In the sample2.VCF, the record is:

1 22082967 rs35545280 CAA C,CA 549.19 PASS AC=1,1;AF=0.500,0.500;AN=2;BaseCounts=0,105,0,0;BaseQRankSum=-1.897;DB;DP=105;FS=0.000;GC=48.13;HaplotypeScore=238.8742;IndelType=MULTIALLELIC_INDEL;LowMQ=0.0000,0.0000,105;MLEAC=1,1;MLEAF=0.500,0.500;MQ=68.78;MQ0=0;MQRankSum=-0.992;PercentNBaseSolid=0.0000;QD=5.23;RPA=20,18,19;RU=A;ReadPosRankSum=-1.344;STR;set=variant2 GT:AD:DP:GQ:PL 1/2:0,11,25:105:99:1363,307,775,501,0,585

Where you can see the genotype is 1/2, but in the combined VCF, it shows as "0/3".

Please advise me if I should use any parameter in the cmd line to solve this problem.

Thank you. Linda

Comments (3)

Hi all, I have a minor concern.

After using CombineVariants tools to combine variants from GATK and Samtools, MQ annotation is missing for some of the variants specifically for variants which are present in both GATK and Samtools. Below is an example:

$grep "16203159" VCF/GatkSNP.vcf
chr8    16203159    .   G   T   4609.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-7.105;ClippingRankSum=0.652;DP=333;     
FS=1.794;MLEAC=1;MLEAF=0.500;MQ=57.94;MQ0=0;MQRankSum=1.205;QD=13.84;ReadPosRankSum=0.157   
GT:AD:DP:GQ:PL  0/1:163,170:333:99:4638,0,4651

$ grep "16203159" VCF/SamSNP.vcf
chr8    16203159    .   G   T   225 .   DP=346;VDB=4.969420e-01;RPB=-1.470555e+00;AF1=0.5;AC1=1;DP4=82,84,81,92;  
MQ=58;FQ=225;PV4=0.66,6.2e-09,1,1   GT:PL:GQ    0/1:255,0,255:99

MQ annotation is present in both the VCF files. But after combining MQ is missing as shown below:

$ grep "16203159" TMP_files/SNP_INDEL.vcf
chr8    16203159    .   G   T   4609.77 .   AC=1;AC1=1;AF=0.500;AF1=0.5;AN=2;BaseQRankSum=-7.105;  
ClippingRankSum=0.652;DP=679;DP4=82,84,81,92;FQ=225;FS=1.794;MLEAC=1;MLEAF=0.500;MQ0=0;MQRankSum=1.205;  
PV4=0.66,6.2e-09,1,1;QD=13.84;RPB=-1.470555e+00;ReadPosRankSum=0.157;VDB=4.969420e-01;set=GatkSNP-SamSNP    
GT:AD:DP:GQ:PL    0/1:163,170:333:99:4638,0,4651

Variants which are present in only of the VCF files do have MQ annotation after using CombineVariants. This happens only for variants which are present in both VCF files. Could you help to get MQ annotations in this case?

Here is the command used:

java -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fa --variant:GatkSNP GatkSNP.vcf --variant:GatkINDEl GatkINDEL.vcf  
--variant:SamSNP SamSNP.vcf --variant:SamINDEL SamINDEL.vcf -o SNP_INDEL.vcf -genotypeMergeOptions PRIORITIZE -priority    
GatkSNP,GatkINDEL,SamSNP,SamINDEL --filteredrecordsmergetype KEEP_UNCONDITIONAL
Comments (1)

Hello everyone, I have used the ion proton system to sequence 15 sample exomes so far. In this system, only 3 samples can be run at a given time, so I essentially have five sets of samples. Is it still possible to use CombineVariants on the VCF files of samples that were from different sequencing runs? Thank you, Ricky

Comments (19)

Hi,

I have used CombineVariants to combine variants from GATK and samtools as shown below:

java -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fa --variant:GatkSNP GATKsnp.vcf --variant:GatkINDEL GATKind.vcf --variant:SamSNP Samsnp.vcf --variant:SamINDEL Samind.vcf -o allvar.vcf -genotypeMergeOptions PRIORITIZE -priority GatkSNP,GatkINDEL,SamSNP,SamINDEL --filteredrecordsmergetype KEEP_UNCONDITIONAL

This merges all the variants. However, with the above command, i do get the variants present in both GATK and samtools emitted from samtools.

I would like to get all the variants such that:

  • variants present in both GATK and samtools emitted from GATK vcf files
  • variants in only GATK
  • variants in only samtools

could someone suggest any ideas or of there is something to be fixed in the command.

Thanks

Comments (8)

I am running CombineVariants followed by VariantEval on two VCF files (GATK version 3.1-1). I noticed that none of my "set=FilteredInAll" variants are included in the output of VariantEval. I suspect this is because the FILTER column is not "PASS" because the variants were not "PASS" in either of the input VCF files.

Is there an argument to pass into VariantEval so that it does not filter variants out based on the FILTER column? I was using the arguments from a guide article (http://gatkforums.broadinstitute.org/discussion/48/using-varianteval). I saw that in the comments you mentioned an --ignoreFilters argument, but when I tried to use it I got the GATK error: Argument with name 'ignoreFilters' isn't defined.

I can just change the FILTER column entries in the combined VCF file, but it seems like a problem that others might have as well.

Thank you!

Comments (1)

Hi I have two vcf files, each with the same chr:pos, but alleles are different:

$ tail -1 test1.snp.vcf chr17 41267926 . G A 50.77 . xxx GT:AD:DP:GQ:P 0/1:2,3:6:49:79,0,49 $ tail -1 test2.snp.vcf chr17 41267926 . G C 15.88 LowQual xxx GT:AD:DP:GQ:P 0/1:3,2:6:44:44,0,85

After running CombineVariants below

$ java -jar GenomeAnalysisTK.jar -T CombineVariants -R ucsc.hg19.fasta -V test1.snp.vcf -V test2.snp.vcf -o ThreeAlleles.snp.vcf -genotypeMergeOptions UNIQUIFY

The ThreeAlleles.snp.vcf only have GT:DP:GQ instead of GT:AD:DP:GQ:PL in the FORMAT field

$ tail -1 ThreeAlleles.snp.vcf chr17 41267926 . G A,C 50.77 PASS xxx GT:DP:GQ 0/1:6:49 0/2:6:44

How can I retain the GT:AD:DP:GQ:PL format for tri-allelic positions?

Thanks,

Comments (3)

I was wondering if there is a method for filtering individual genotype calls when using CombineVariants to merge single-called VCF files. The desired behavior that I would like would be a hybrid between the KEEP_IF_ANY_UNFILTERED and KEEP_IF_ALL_UNFILTERED arguments to the -filteredRecordsMergeType. By this, I mean that any site that is unfiltered in any input will remain unfiltered in the output, but for any genotype call from a filtered input should have a filter annotation in the "FT" field of the genotype. I will show a simplified example below (extraneous columns removed from the sample files):

Input 1:

#CHROM POS       ID         (...) FILTER  FORMAT  SAMPLE1
1      11916764  rs79387574 (...) PASS    GT:DP   0/0:45

Input 2:

#CHROM POS       ID         (...) FILTER  FORMAT  SAMPLE2
1      11916764  rs79387574 (...) LowQ    GT:DP   0/1:3

Desired Output:

#CHROM POS       ID         (...) FILTER  FORMAT    SAMPLE1      SAMPLE2
1      11916764  rs79387574 (...) PASS    GT:DP:FT  0/0:45:PASS  0/1:3:LowQ

The reason for requesting this is there is occasionally a single sample that may have had a bad call at a site. Using the "KEEP_IF_ALL_UNFILTERED" filters N-1 high quality calls. However, on the other extreme, if we use "KEEP_IF_ANY_UNFILTERED" and only a single sample passes the filters, we introduce N-1 low quality calls and assert that they pass our requisite filters. The requested hybrid method will keep all information from the input samples and allow for better granularity.

John Wallace

Comments (2)

I've noticed a small bug with GATK tools and VCF files. CombineVariants and GenotypeGVCF can generate files where some samples have fewer fields than the format column.

For instance, this is part of a line from the VQSR-ed output VCF of GenotypeGVCF: 1 15820 rs200482301 G T 5909.59 VQSRTrancheSNP99.90to100.00 AC=21;AF=0.154;AN=136..... GT:AD:DP:GQ:PL 0/0:.:40:66:0,66,990 0/0:.:41:69:0,69,1035 1/1:0,20,1:21:78:985,78,0 0/0:.:35:60:0,60,900 ./.:.:1 0/0:.:7:21:0,21,233 ..............

The second to last sample entry is ./.:.:1 (3 fields), while the format entry is GT:AD:DP:GQ:PL (5 fields). I think that GT=./., AD=., and DP=1, so the data is not getting messed up. This might even be within the rules of VCF, but one of the software that I use will not parse VCF files when 1 < # sample fields < # format fields. If sample entries were extended by ":." for every empty FORMAT field (unless only . or ./. was present in sample column), it would make the file parsable.

It's not too hard of a manual fix, but it might be nice to add the functionality into the toolkit. I've seen it happen with CombineVariants as well, when the input VCF files have different numbers of FORMAT fields.

Comments (13)

Hi,

When using HaplotypeCaller's '--emitRefConfidence', with both GVCF and BP_RESOLUTION, we get lines like this one: 1 2337032 rs1129171 C T,<NON_REF> 480.77 . BaseQRankSum=0.218;ClippingRankSum=0.103;DB;DP=45;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.344;ReadPosRankSum=1.046 GT:AD:DP:GQ:PL:SB 0/1:19,26,0:45:99:509,0,330,565,407,97 2:9,10,13,13

The new 3.0 tool GenotypeGVCFs has no problem processing these lines. However CombineVariants fails to run with this message:

##### ERROR MESSAGE: Reference records should not have more than one alternate allele

By looking at the relevant code:

public/gatk-framework/src/main/java/org/broadinstitute/sting/utils/variant/GATKVariantContextUtils.java if( vc.hasAllele(NON_REF_SYMBOLIC_ALLELE) ) { if( vc.getAlternateAlleles().size() > 1 ) { throw new IllegalStateException("Reference records should not have more than one alternate allele"); }

Sounds like the presence of <NON_REF> in lines with ALTs is triggering this exception.

I also looked at this forum entry and the example there doesn't have <NON_REF> in lines with ALTs.

"Using the reference confidence calculation mode & generating a GVCF"

Should I be able to run CombineVariants on files produced using '--emitRefConfidence'?

Thanks, Carlos PS: I tried to include links to github and GATK forum, but the spam filter seems not to like that.

Comments (1)

Is there a way to pass a file containing a list of vcfs to CombineVariants --variant? I tried naming it with .list, but I get a tribble error.

Comments (4)

I believe that I may have found an issue with the CombineVariants tool of GATK that manifests itself when there is a repeated ID in a given VCF. For us, the reason to have repeated IDs in a VCF file is to detect inconsistencies in our sample by calling variants on 2 different DNA samples and then checking the concordance. Our current process is:

1) Generate a VCF containing unique IDs (using GATK CallVariants)
2) Replace the VCF header with potentially non-unique IDs (using tabix -r)
3) Merge a single VCF to uniqify the IDs (using GATK CombineVariants)

It seems that the genotypes in the merged VCF are off by one column. I've attached 3 files that demonstrate the issue: "combined" which is the result of step 1, "combined.renamed", which is the output of step 2, and "combined.renamed.merged", which is the output of step 3.

The relevant lines are as follows:
combined:

HG00421@123910725 HG00422 HG00422@123910706 HG00423@123910701 NA12801 NA12802
0/0:300           0/0:127 0/0:292           0/0:290           0/0:127 0/0:127
0/0:299           0/0:127 0/0:299           0/0:293           0/0:127 0/0:127

combined.renamed:

HG00421 HG00422 HG00422 HG00423 NA12801 NA12802
0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:127
0/0:299 0/0:127 0/0:299 0/0:293 0/0:127 0/0:127

combined.renamed.merged:

HG00421 HG00422 HG00423 NA12801 NA12802
0/0:300 0/0:127 0/0:292 0/0:290 0/0:127
0/0:299 0/0:127 0/0:299 0/0:293 0/0:127

Using the depth argument here, we can see that in the merged dataset, NA12801 has depths 290,293 whereas in the original and renamed datasets the depths were 127,127. The 290,293 depths correspond to HG00423, which is the column before.

I have confirmed this behavior in both GATK 2.7-4 and 2.8-1. If there's any more information that you need, please let me know, and I would be happy to provide it. Also, if you might know where this issue arises, I would be happy to try to provide a patch.

Thanks,

John Wallace

Comments (9)

Hi team,

I've been having some issues with DP following CombineVariants:

I have two vcf files called by different callers - one by GATK UnifiedGenotyper and the other by samtools mpileup.

When I merge the files using CombineVariants, I notice that the DP per each variant is actually equal to the sum of DP of each of the vcf files. For example: If for a shared variant in both vcf files the DP=8, then the DP in the union file will be DP=16. Neverhteless, if a given variant is not shared by both files, then the DP in the union file will be equal to the input file.

Is there a way resolve this issue?

Thanks!

Best,

Sagi

Comments (4)

Hi sir, I used GATK combine variants to call SNP from my 7 vcf files. Now I am using unified genotyper to call all samples simultaneously for SNP. When I focused on my gene of interest (DLX6), in combine variants output, there are 5 "./." among 7 samples. but in case of multisample SNP calling, DLX6 gave 0/0 in those samples and its corresponding information. So, I want to know what is the meaning of "./." in combine variants results and which results should I take for analysis?

Comments (11)

hello,

I am using CombineVariants to combine two multisample vcfs from made from HaplotypeCaller. The vcfs cover the same genomic region, and do not overlap in sample names. I have noticed some potential issues reporting AD and PL fields after running CombineVariants.

If the site was multiallelic in one vcf but biallelic in the other vcf, then the AD and PL fields are omitted altogether, such that GT:DP:GQ is what is reported in the combined file - I am guessing this is expected behavior?

however, if the site was biallelic in both vcfs, or monomorphic on one vcf but polymorphic in the other, the AD and PL fields are reported, but the numbers do not make sense.

my command line:

GATK -T CombineVariants -R ~/Capsella_rubella_v1.0_combined.fasta -V sc1_HC_94samp.vcf -V sc1_HC_set2.vcf -nt 20 -o sc1_HC.vcf &

original vcf record:

scaffold_1 3275 . G T 1170.18 . AC=2;AF=0.011;AN=188;BaseQRankSum=0.295;ClippingRankSum=-1.233;DP=5815;FS=1.186;InbreedingCoeff=-0.0106;MLEAC=1;MLEAF=5.319e-03;MQ=98.00;MQ0=0;MQRankSum=3.414;QD=9.36;ReadPosRankSum=1.761
GT:AD:DP:GQ:PL 0/0:59,0:59:99:0,293,4371 0/0:56,0:56:99:0,218,2973 0/0:40,0:40:99:0,161,2615 0/0:65,0:65:99:0,239,3898 0/0:62,0:62:99:0,227,3612

combined vcf record:

scaffold_1 3275 . G T 1170.18 . AC=3;AF=7.979e-03;AN=376;DP=11787;MLEAC=1;MLEAF=5.319e-03;MQ0=0;set=Intersection
GT:AD:DP:GQ:PL 0/0:0,0:59:99:0,92,0 0/0:0,42:56:99:3,5,2805 0/0:675,0:40:99:29,26,8482 0/0:45,0:65:99:43,6,11447 0/0:0,187:62:99:4079,0,0 0/0:55,0:27:99:0,182,941 0/0:0,251:69:99:0,162,2300

is this a bug? or am I missing something about how to use CombineVariants? thanks much for your help!

YW

Comments (7)

I'm analyzing seven trio exomes right now with the latest GATK (version 2.7-4-g6f46d11), and was surprised to find a large number of mendelian violations reported by PhaseByTransmission, even after eliminating low/no coverage events. Tracking down the problem, it seems that CombineVariants occasionally propagates the PL field to the new vcf file incorrectly, sometimes in a way which causes GT not to correspond to the lowest PL.

Here's an example, showing just the GT, AD, and PL columns for a few positions in one trio. For each position, the first line contains the genotypes from the original vcf file, and the second shows the genotypes from the merged file.

#CHROM  POS       ID           REF  ALT  100403001-1           100403001-1A           100403001-1B        
                                                                                                          
1       5933530   rs905469     A    G    0/0:37,0:0,99,1192    0/0:35,0:0,90,1101     0/0:44,0:0,117,1412 
1       5933530   rs905469     A    G    0/0:37,0:189,15,1192  0/0:35,0:0,90,1101     0/0:44,0:0,117,1412 
                                                                                                          
1       10412636  rs4846215    A    T    0/0:119,0:0,358,4297  0/0:113,0:0,337,4060   0/0:102,0:0,304,3622
1       10412636  rs4846215    A    T    0/0:119,0:110,9,0     0/0:113,0:0,337,4060   0/0:102,0:0,304,3622
                                                                                                          
1       11729035  rs79974326   G    C    0/0:50,0:0,141,1709   0/0:53,0:0,150,1788    0/0:71,0:0,187,2246 
1       11729035  rs79974326   G    C    0/0:50,0:1930,0,3851  0/0:53,0:0,150,1788    0/0:71,0:0,187,2246 
                                                                                                          
1       16735764  rs182873855  G    A    0/0:54,0:0,138,1691   0/0:57,0:0,153,1841    0/0:47,0:0,120,1441 
1       16735764  rs182873855  G    A    0/0:54,0:174,0,1691   0/0:57,0:0,153,1841    0/0:47,0:0,120,1441 
                                                                                                          
1       17316577  rs77880760   G    T    0/0:42,0:0,123,1470   0/0:38,0:0,111,1317    0/0:53,0:0,153,1817 
1       17316577  rs77880760   G    T    0/0:42,0:233,17,1470  0/0:38,0:0,111,1317    0/0:225,25:0,153,181
                                                                                                          
1       28116000  rs2294229    A    G    0/0:37,0:0,105,1291   0/0:37,0:0,111,1379    0/0:30,0:0,87,1066  
1       28116000  rs2294229    A    G    0/0:37,0:0,105,1291   0/0:37,0:0,111,1379    0/0:30,0:1844,159,0 
                                                                                                          
1       31740706  rs3753373    A    G    0/0:123,0:0,349,4173  0/0:110,0:0,319,3793   0/0:111,0:0,328,3885
1       31740706  rs3753373    A    G    0/0:123,0:117,6,0     0/0:110,0:0,319,3793   0/0:111,0:0,328,3885

Most genotypes are propagated correctly, and in fact, which a propagated incorrectly changes from run to run.

In my case, I'm merging files from disjoint regions, so I can work around the problem, but it would be nice if this were fixed.

Thanks, Kevin

Comments (5)

Hi! This is my code:

java -Xmx4g -jar GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar \
    -R genome.fa \
    --filter_reads_with_N_cigar \
    -T CombineVariants \
    -V:NORMALSpost RNA-edit_SNPs_NORMALS-post_SNPeff_VA.vfc \
    -V:OMNI 1000G_omni2.5.hg19.vcf \
    -V:db137 dbsnp_137.hg19.vcf \
    -V:Hapmap hapmap_3.3.hg19.vcf \
    -V:ESP1,VCF ESP6500SI-V2-SSA137.updatedRsIds.Allchr.snps_indels_FIX.vcf \
    --out NORMALS-post_TruePositives_raw.vcf &&

java -Xmx4g -jar GenomeAnalysisTK.jar \
    -R genome.fa \
    --filter_reads_with_N_cigar \
    -T SelectVariants \
    -V NORMALS-post_TruePositives_raw.vcf \
    -select "set == 'NORMALSpost'" \
    --out NORMALS-post_TruePositives.vcf

I hope I got this right: I now have SNPs (I have only called SNPs, not indels) that is found in my sample, 1000G, dbSNP and ESP. Right?

Question: Can I easily make a VCF with all SNPs in MY sample only, not found in anything else by tweaking this code? Or must I do it some way else?

Thanks!

Comments (8)

Hi! Sorry for my ignorance, but the ESP files contains 24 vcd in total. I tried to include them all on CombineVariants, but gets an error. I have tried to merge the 24 file to one, both with vcftools and picard, failing with both..

Somebody knows?

Thank you!

Comments (3)

If I want to merge different VCF files, which I used -L argument for calling variants against to different chromosomes individually with the same list of samples by HaplotypeCaller. I mean the sample are the same, I just used -L to call variants chromosome by chromosome separately. I suppose whether catVariants or CombineVariant will give me the same results, right ?

Comments (2)

Hi! The title says it all. Im trying to narrow down my vcf. Which tools is good to answer this question?

I have tried this:

java -Xmx4g -jar GenomeAnalysisTK.jar \ -R genome.fa \ --filter_reads_with_N_cigar \ -T CombineVariants \ -V:COHORT1 path \ -V:COHORT2 path \ --filteredAreUncalled \ -o path

But the output seams incomprehensibly massive!

Thanks!

Comments (10)

is there any way to combine a snp vcf and indel vcf (generated with the UnifiedGenotyper) later? in the way that there is only one row per locus?

regardless how I combine (I tried mainly CombineVariants), if there is something different called in the two vcf files in one locus, there are two rows in the combined one; I would like this called/written as alternatives for one locus

Comments (5)

Hi,

I'm working with different tools to call variants (GATK is one of them) for the same sample and I'm merging these results with

  
gatk -T CombineVariants -R ucsc.hg19.fasta -V:GATK GATK.vcf -V:OTHER OTHER.vcf -o combined.vcf -genotypeMergeOptions PRIORITIZE --rod_priority_list GATK,OTHER

But when I have a discrepancy between results (I mean, same chr and pos, but different call), I just get GATK's result in my result file, Is there any way to have both calls when I have discrepancies? I don't want to use -genotypeMergeOptions UNIQUIFY.

EXAMPLE:

  
GATK.vcf:
chr17   7917905 .       C       T       360.16  .       [...]   GT     0/1
chr17   7918012 .       C       T       896.16  .       [...]  GT     0/1

  
OTHER.vcf:
chr17   7917905 .       C       T       360.16  .       [...]   GT     0/1
chr17   7918012 .       C       T       896.16  .       [...]  GT     1/1

  
combined.vcf:
chr17   7917905 .       C       T       360.16  .       [...];set=Intersection   GT     0/1
chr17   7918012 .       C       T       896.16  .       [...];set=Intersection  GT     0/1

What I want to obtain:

  
combined.vcf:
chr17   7917905 .       C       T       360.16  .       [...];set=Intersection   GT     0/1
chr17   7918012 .       C       T       896.16  .       [...];set=GATK  GT     0/1
chr17   7918012 .       C       T       896.16  .       [...];set=OTHER  GT     1/1

Comments (4)

Hi, I'm calling Variants with HaplotypeCaller in a population of 2 Parents and 7 F1-individuals. After read backed phasing I'm combining the vcf files of my genotypes with CombineVariants. In the outfile I very often find "./.". I thought this means there is no coverage at a certain position. But at many positions I do have good coverage. Why do I then get ./.? Moreover I used FastaAlternateReferenceMaker and created a new reference sequence including the variants from the parents. In that case, after I run HC and do the phasing and combine variants steps, I only get "./." at positions where there is really no coverage (as I can see in my mappings). Nadia

Comments (7)

HI

my two VCF files have different alternate allele at the same position as i have called the variants using two different callers. When i run combine Variants on the both my GT field is not updated properly. AD field is also not updated properly but i ran Variant Annotator and that fixes that issue.

VCF 1:

chr1    87708015    rs58006838  C   T   145.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-1.479;DB;DP=14;Dels=0.00;FS=0.000;HaplotypeScore=3.9299;MLEAC=1;MLEAF=0.500;MQ=68.54;MQ0=0;MQRankSum=-1.109;QD=10.41;ReadPosRankSum=0.925  GT:AD:DP:GQ:PL  0/1:3,9:14:37:174,0,37

VCF 2:

chr1    87708015    .   C   A   .   PASS    AN=2;DP=4;NS=1  GT:AD:DP:GQ 0/1:2,2:4:8.42

Merged VCF file:

chr1    87708015    rs58006838  C   T,A 145.77  PASS    AC=1,0;AF=0.500,0.00;AN=2;BaseQRankSum=-1.479;DB;DP=18;Dels=0.00;FS=0.000;HaplotypeScore=3.9299;MLEAC=1;MLEAF=0.500;MQ=68.54;MQ0=0;MQRankSum=-1.109;NS=1;QD=10.41;ReadPosRankSum=0.925;set=Intersection GT:AD:DP:GQ **0/1**:3,9,2:14:37

command used:

java -jar $gatk/2.7-1-g42d771f/GenomeAnalysisTK.jar -T CombineVariants -V one.vcf.gz -V two.vcf.gz -o test.vcf -R $ref

Is this a known limitation or a bug?

Comments (5)

Hello!

So today I have to compare two different vcf (from different pipelines, only one of them is from GATK) and when I tried to combine it with combineVariants, I had an error because the references doesn't match:

The provided variant file(s) have inconsistent references for the same position(s) at chr1:1724719, A* vs. G*

I know why it's happening, but I don't know if there is an option/tools to "fix" the vcf from the other pipeline.

That I want to do is take the VCF created with a "bad" reference and use a command or GATK option to "fix" it according with the correct reference fasta file.

Thanks for your help!

Comments (1)

I've seen related issues discussed here but not exactly this one. I'm following closely the current recommendations for an exome pipeline, and the GATK version,downloaded from git, was v2.5-2-gf57256b, Compiled 2013/06/06 17:28:57.

For example, I have two samples with heterozygous variants 12:81503433C>G. The AD values for the the samples in the raw vcf file, and the SNVs-only file were 15,14 and 20,15 for the two samples and these agree with what I see in IGV. There was nothing in the indels-only file at that position. The AD values were the same in the recalibrated SNVs-only file. But after combining the recalibrated SNVs and indels with CombineVariants the AD values inexplicably became 21,0 and 0,24 respectively. This seems to be happening to many variants.

Comments (3)

-T CombineVariants -genotypeMergeOptions UNIQUIFY Is UNIQUIFY the default? I want every variant in the output file with each individual in the same column

Comments (5)

When using CombineVariants, my variants get a FILTER value of either PASS or LowQual. Would it be possible to add an option to CombineVariants which prevents the FILTER value to be set to PASS? Otherwise I have to do some file processing before I run ApplyRecalibration further downstream. It would be great if this was a feature of all walkers and not just VariantFiltration. I'm not sure if the forum is the right place for feature requests. Happy to use Bugzilla or similar instead. Thanks.

Comments (4)

I have a set of VCFs with identical positions in them:

VCF1: 1 10097 . T . 26 . AN=196;DP=1622;MQ=20.06;MQ0=456 GT:DP

VCF2: 1 10097 . T . 21.34 . AN=198;DP=2338;MQ=19.53;MQ0=633 GT:DP

VCF3: 1 10097 . T . 11.70 . AN=240;DP=3957;MQ=19.74;MQ0=1085 GT:DP

VCF4: 1 10097 . T . 15.56 . AN=134;DP=1348;MQ=18.22;MQ0=442 GT:DP

If I use all of them as input for VariantRecalibrator, which annotations will VariantRecalibrator use? Should I instead merge the VCFs with CombineVariants and run VariantAnnotator, before I run VariantRecalibrator?

I'm not sure if the forum is for asking technical questions only or you are allowed to ask for best practices as well. Feel free to delete my question, if it doesn't belong here. Thank you.

Comments (4)

Hello,

I called variants in chunks (100000 bp chunks) on chromosome 20 on 450 reduced BAMS using -T UnifiedGenotyper, which resulted in the generation of 631 VCFs.

Now I want to combine these 631 VCFs using -T CombineVariants, however there seems to be a problem with some of the VCFs and all I get is an error message.

I can combine the 5 first VCFs that were generated, but when I add a sixth one (chr20:400000_500000) I get the following error:

##### ERROR MESSAGE: For input string: "20"

If I exclude that file, I can combine about 30 VCFs until I encounter a similar error message for a different VCF file:

##### ERROR MESSAGE: For input string: "120"

I've looked through both of the "problematic" VCFs and can't see how they differ from the ones that I can combine.

Any idea of what may be going wrong and how I can solve this? At the moment I can only identify problematic VCF by trying to combine them using CombineVariants since I don't know what it is about these particular VCF files that is causing the problem.

Thanks for your help, Tota

Comments (4)

Hi. I want to merge two VCF files. Initially I was selected only indels(by select variant option). Now I want to merge these two VCF file which contains only INDELS. But When I run the command, I am getting the same error:

ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
java.lang.NumberFormatException: For input string: "."

I run this command:

java -jar -Xmx2g GenomeAnalysisTK.jar -R hg19_5.fasta -T CombineVariants -V indelsample1.vcf -V indelsample3.vcf -o indels1s3.vcf -genotypeMergeOptions UNIQUIFY

Could you please tell me what is the reason behind this? and how to merge two VCF file having INDELS?

Thanks in advance.

Comments (1)

I am trying to merge two vcfs (SNVs and INDELs) from the same sample. The problem appears to be that the INDEL vcf defines "combined_sample_name" but the SNV vcf does not. So when I merge I get two sample columns. How can I force GATK to treat them as a single sample?

I tried --assumeIdenticalSamples to do a "simple merge," but that made no difference.

As a side note, these vcfs are from an Ion Torrent machine.

Thanks!

java -jar $GATK \
    -R $TSP_FILEPATH_GENOME_FASTA \
    -T CombineVariants \
    --assumeIdenticalSamples \
    -V:${baseFolderName} ${i}/SNP_variants.vcf \
    -V:${baseFolderName} ${i}/indel_variants.vcf \
    -o ${RESULTS_DIR}/${baseFolderName}_variants.vcf
Comments (6)

Hi,

I'm just reverse engineering a colleagues script and I've noticed they're using CombineVariants in PRIORITIZE mode but without a -priority argument. I've looked at the documentation and I can't see what the defined behaviour would be in this situation. Would default priority in this situation follow the order of the arguments supplied; the reverse order; or random?

Thanks, Martin

Edit: Nevermind, from what I can see from the source it should be erroring out if -priority is not supplied. I must have missed something in the pipeline script.

Edit 2: No wait

    if ( genotypeMergeOption == VariantContextUtils.GenotypeMergeType.PRIORITIZE && PRIORITY_STRING == null )
        throw new UserException.MissingArgument("rod_priority_list", "Priority string must be provided if you want to prioritize genotypes");

is pointless because this is run first in initialize:

    if ( PRIORITY_STRING == null ) {
        PRIORITY_STRING = Utils.join(",", vcfRods.keySet());
        logger.info("Priority string not provided, using arbitrary genotyping order: " + PRIORITY_STRING);
    }

This should follow the input order yes? Unless vcfRods.keySet is sorted?

Comments (8)

Hi,

I used Beagle to phase my data but for some indels, I have some probleme :

example :

Input vcf :

2       68599872        .       ATG     A       14.40   PASS    AC=1;AC1=1;AF=0.028

Input for beagle created by ProduceBeagleInput:

2:68599872 TG - 1.0000 0.0000 0.0000 ......

Output vcf created by BeagleOutputToVCF:

2       68599872        .       ATG     .       14.40   BGL_RM_WAS_-    AC1=1;AF1=0.02965.....

error message by CombineVariants:

MESSAGE: Badly formed variant context at location 68599872 in contig 2. Reference length must be at most one base shorter than location size

Can you help me?

Tipahine

Comments (4)

Dear team, I am new to GATK and I am having a hard time simply trying to merge vcf files. I have tried to solve the problem by referring to the guide and to previous posts, but nothing woked. Actually I found several discussions about the very same error message I receive, but it seems that no clear answere was provided. Here is this message:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.1-12-ga99c19d):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
ERROR ------------------------------------------------------------------------------------------

I have tried three different MS Dos commands to do the task (see belbow), but the message didn't change:

1. java -jar GenomeAnalysisTK.jar -T CombineVariants -R E:\RessourcesGATK\ucsc.hg19.fasta -V:sample1 E:\TestGATK\sample1.vcf -V:sample2 E:\TestGATK\sample2.vcf -o combined.vcf

2. java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants  --variant E:\TestGATK\sample1.vcf  --variant E:\TestGATK\sample2.vcf  -o output.vcf  -genotypeMergeOptions UNIQUIFY

3.java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta  -T CombineVariants  --variant E:\TestGATK\sample1.vcf  --variant E:\TestGATK\sample2.vcf  -o output.vcf  -genotypeMergeOptions PRIORITIZE  -priority foo,bar

I have also tried to use the reference human_g1k_v37.fasta as a resource, but it was the same. I have suppressed the # before CHROM in the header line, tested vcf generated by Samtools or by GATK, but it did not change anything. Is this a problem of environment? I haven't read anything mentioning that GATK could not work with MS Dos.

Thank you very much for your help. S.

Comments (5)

When I run CombineVariants on two vcf files with variants at the same position but with different REF alleles and also different sets of ALT alleles, the AD fields for the genotypes are not updated to reflect the changes in the ALT field. The REF and ALT fields and the GT field for each genotype are all correctly updated. For example combining

3   10128965    rs71052293  CTT CT,C,CTTT   19936.43    PASS    AC=1,1,1;AF=0.25,0.25,0.25;AN=4 GT:AD:DP:GQ:PL  0/2:115,0,33,12:230:6.96:980,1237,2795,0,946,1900,7,679,467,817 3/1:97,13,20,16:229:99:804,221,832,581,176,3047,521,0,1653,1595

and

3   10128965    rs71052293  CT  C,CTT,CTTT  14280.61    PASS    AC=1,1,1;AF=0.25,0.25,0.25;AN=4 GT:AD:DP:GQ:PL  2/1:110,20,33,18:237:1.90:850,289,1027,457,0,1487,147,877,2,1858    0/3:80,48,5,29:209:99:1835,875,977,2101,1119,3322,0,142,331,462

gives

3   10128965    rs71052293  CTT CT,C,CTTT,CTTTT 19936.43    PASS    AC=2,1,2,1;AF=0.250,0.125,0.250,0.125;AN=8;set=Intersection GT:AD:DP:GQ 0/2:115,0,33,12:230:7   3/1:97,13,20,16:229:99  3/1:110,20,33,18:237:2  0/4:80,48,5,29:209:99

There five alleles (one REF and four ALT) but only four AD fields for each genotype.

My command line:

java -jar -Xmx4g GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V test_input1.vcf -V test_input2.vcf -o test_combined.vcf

Is this a known limitation or a bug?