Tagged with #varianteval
3 documentation articles | 1 announcement | 24 forum discussions

Created 2015-10-21 15:23:10 | Updated 2015-11-24 17:25:11 | Tags: varianteval
Comments (0)


Running through the steps involved in variant discovery (calling variants, joint genotyping and applying filters) produces a variant callset in the form of a VCF file. So what’s next? Technically, that callset is ready to be used in downstream analysis. But before you do that, we recommend running some quality control analyses to evaluate how “good” that callset is.

To be frank, distinguishing between a “good” callset and a “bad” callset is a complex problem. If you knew the absolute truth of what variants are present or not in your samples, you probably wouldn’t be here running variant discovery on some high-throughput sequencing data. Your fresh new callset is your attempt to discover that truth. So how do you know how close you got?

Methods for variant evaluation

There are several methods that you can apply which offer different insights into the probable biological truth, all with their own pros and cons. Possibly the most trusted method is Sanger sequencing of regions surrounding putative variants. However, it is also the least scalable as it would be prohibitively costly and time-consuming to apply to an entire callset. Typically, Sanger sequencing is only applied to validate candidate variants that are judged highly likely. Another popular method is to evaluate concordance against results obtained from a genotyping chip run on the same samples. This is much more scalable, and conveniently also doubles as a quality control method to detect sample swaps. Although it only covers the subset of known variants that the chip was designed for, this method can give you a pretty good indication of both sensitivity (ability to detect true variants) and specificity (not calling variants where there are none). This is something we do systematically for all samples in the Broad’s production pipelines.

The third method, presented here, is to evaluate how your variant callset stacks up against another variant callset (typically derived from other samples) that is considered to be a “truth set”. The general idea is that key properties of your callset (metrics discussed later in the text) should roughly match those of the truth set. This method is not meant to render any judgments about the veracity of individual variant calls; instead, it aims to estimate the overall quality of your callset and detect any “red flags” that might be indicative of error.

Underlying assumptions and truthiness*: a note of caution

It should be immediately obvious that there are two important assumptions being made here: 1) that the content of the truth set has been validated somehow and is considered especially trustworthy; and 2) that your samples are expected to have similar genomic content as the population of samples that was used to produce the truth set. These assumptions are not always well-supported, depending on the truth set, your callset, and what they have (or don’t have) in common. You should always keep this in mind when choosing a truth set for your evaluation; it’s a jungle out there. Consider that if anyone can submit variants to a truth set’s database without a well-regulated validation process, and there is no process for removing variants if someone later finds they were wrong (I’m looking at you, dbSNP), you should be extra cautious in interpreting results. *With apologies to Stephen Colbert.


So what constitutes validation? Well, the best validation is done with orthogonal methods, meaning that it is done with technology (wetware, hardware, software, etc.) that is not subject to the same error modes as the sequencing process. Calling variants with two callers that use similar algorithms? Great way to reinforce your biases. It won’t mean anything that both give the same results; they could both be making the same mistakes. On the wetlab side, Sanger and genotyping chips are great validation tools; the technology is pretty different, so they tend to make different mistakes. Therefore it means more if they agree or disagree with calls made from high-throughput sequencing.

Matching populations

Regarding the population genomics aspect: it’s complicated -- especially if we’re talking about humans (I am). There’s a lot of interesting literature on this topic; for now let’s just summarize by saying that some important variant calling metrics vary depending on ethnicity. So if you are studying a population with a very specific ethnic composition, you should try to find a truth set composed of individuals with a similar ethnic background, and adjust your expectations accordingly for some metrics. Similar principles apply to non-human genomic data, with important variations depending on whether you’re looking at wild or domesticated populations, natural or experimentally manipulated lineages, and so on. Unfortunately we can’t currently provide any detailed guidance on this topic, but hopefully this explanation of the logic and considerations involved will help you formulate a variant evaluation strategy that is appropriate for your organism of interest.

Variant evaluation metrics

So let’s say you’ve got your fresh new callset and you’ve found an appropriate truth set. You’re ready to look at some metrics (but don’t worry yet about how; we’ll get to that soon enough). There are several metrics that we recommend examining in order to evaluate your data. The set described here should be considered a minimum and is by no means exclusive. It is nearly always better to evaluate more metrics if you possess the appropriate data to do so -- and as long as you understand why those additional metrics are meaningful. Please don’t try to use metrics that you don’t understand properly, because misunderstandings lead to confusion; confusion leads to worry; and worry leads to too many desperate posts on the GATK forum.

Expected Values/Ranges for Human Germline Data

Sequencing Type # of Variants* TiTv Ratio
WGS ~4.4M 2.0-2.1
WES ~41k 3.0-3.3

*for a single sample

  • Number of Indels & SNPs The number of variants detected in your sample(s) are counted separately as indels (insertions and deletions) and SNPs (Single Nucleotide Polymorphisms). Many factors can affect this statistic including whole exome (WES) versus whole genome (WGS) data, cohort size, strictness of filtering through the GATK pipeline, the ethnicity of your sample(s), and even algorithm improvement due to a software update. For reference, Nature's recently published 2015 paper in which various ethnicities in a moderately large cohort were analyzed for number of variants. As such, this metric alone is insufficient to confirm data validity, but it can raise warning flags when something went extremely wrong: e.g. 1000 variants in a large cohort WGS data set, or 4 billion variants in a ten-sample whole-exome set.

  • TiTv Ratio This metric is the ratio of transition (Ti) to transversion (Tv) SNPs. If the distribution of transition and transversion mutations were random (i.e. without any biological influence) we would expect a ratio of 0.5. This is simply due to the fact that there are twice as many possible transversion mutations than there are transitions. However, in the biological context, it is very common to see a methylated cytosine undergo deamination to become thymine. As this is a transition mutation, it has been shown to increase the expected random ratio from 0.5 to ~2.01. Furthermore, CpG islands, usually found in primer regions, have higher concentrations of methylcytosines. By including these regions, whole exome sequencing shows an even stronger lean towards transition mutations, with an expected ratio of 3.0-3.3. A significant deviation from the expected values could indicate artifactual variants causing bias. If your TiTv Ratio is too low, your callset likely has more false positives.
    It should also be noted that the TiTv ratio from exome-sequenced data will vary from the expected value based upon the length of flanking sequences. When we analyze exome sequence data, we add some padding (usually 100 bases) around the targeted regions (using the -ip engine argument) because this improves calling of variants that are at the edges of exons (whether inside the exon sequence or in the promoter/regulatory sequence before the exon). These flanking sequences are not subject to the same evolutionary pressures as the exons themselves, so the number of transition and transversion mutants lean away from the expected ratio. The amount of "lean" depends on how long the flanking sequence is.

Filtering for Indel Ratio
common ~1
rare 0.2-0.5
  • Indel Ratio This ratio of insertions to deletions can vary if you are filtering to find common or rare variants. A significant deviation from the expected ratios listed in the table above could indicate a bias resulting from artifactual variants.

  • % Concordance This metric gives the percentage of variants in your samples that match (are concordant with) variants in your truth set; it essentially serves as a check of how well your pipeline identified variants contained in the truth set. Depending on what you are evaluating and comparing, percent concordance can vary quite significantly. An excellent truth set to use against your sample(s) would be a genotyping chip matched per sample, as this can be used to verify biological variants. Comparison to an unmatched truth set, however, could identify unique variants to your data set. Percent concordance considers all unmatched variants to be false positives. Depending on your trust in the truth set and whether or not you expect to see true, novel variants, these unmatched variants could warrant further investigation.

Tools for performing variant evaluation


This is the GATK’s main tool for variant evaluation. It is designed to collect and calculate a variety of callset metrics that are organized in evaluation modules, which are listed in the tool doc. For each evaluation module that is enabled, the tool will produce a table containing the corresponding callset metrics based on the specified inputs (your callset of interest and one or more truth sets). By default, VariantEval will run with a specific subset of the available modules (listed below), but all evaluation modules can be enabled or disabled from the command line. We recommend setting the tool to produce only the metrics that you are interested in, because each active module adds to the computational requirements and overall runtime of the tool.

It should be noted that all module calculations only include variants that passed filtering (i.e. FILTER column in your vcf file should read PASS); variants tagged as filtered out will be ignored. It is not possible to modify this behavior. See the example analysis (document in progress) for more details on how to use this tool and interpret its output.


This tool calculates -- you’ve guessed it -- the genotype concordance between callsets. In earlier versions of GATK, GenotypeConcordance was itself a module within VariantEval. It was converted into a standalone tool to enable more complex genotype concordance calculations.

Picard tools

The Picard toolkit includes two tools that perform similar functions to VariantEval and GenotypeConcordance, respectively called CollectVariantCallingMetrics and GenotypeConcordance. Both are relatively lightweight in comparison to their GATK equivalents; their functionalities are more limited, but they do run quite a bit faster. See the example analysis (document in progress) of CollectVariantCallingMetrics for details on its use and data interpretation. Note that in the coming months, the Picard tools are going to be integrated into the next major version of GATK, so at that occasion we plan to consolidate these two pairs of homologous tools to eliminate redundancy.

Which tool should I use?

We recommend Picard's version of each tool for most cases. The GenotypeConcordance tools provide mostly the same information, but Picard's version is preferred by Broadies. Both VariantEval and CollectVariantCallingMetrics produce similar metrics, however the latter runs faster and is scales better for larger cohorts. By default, CollectVariantCallingMetrics stratifies by sample, allowing you to see the value of relevant statistics as they pertain to specific samples in your cohort. It includes all metrics discussed here, as well as a few more. On the other hand, VariantEval provides many more metrics beyond the minimum described here for analysis. It should be noted that none of these tools use phasing to determine metrics. So when should I use CollectVariantCallingMetrics?

  • If you have a very large callset
  • If you want to look at the metrics discussed here and not much else
  • If you want your analysis back quickly

When should I use VariantEval?

  • When you require a more detailed analysis of your callset
  • If you need to stratify your callset by another factor (allele frequency, indel size, etc.)
  • If you need to compare to multiple truth sets at the same time

Created 2013-03-18 20:25:42 | Updated 2013-03-18 20:26:03 | Tags: official varianteval analyst intermediate tooltips
Comments (10)

VariantEval accepts two types of modules: stratification and evaluation modules.

  • Stratification modules will stratify (group) the variants based on certain properties.
  • Evaluation modules will compute certain metrics for the variants


CpG is a three-state stratification:

  • The locus is a CpG site ("CpG")
  • The locus is not a CpG site ("non_CpG")
  • The locus is either a CpG or not a CpG site ("all")

A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.


EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.


Sample is an N-state stratification, where N is the number of samples in the eval files.


Filter is a three-state stratification:

  • The locus passes QC filters ("called")
  • The locus fails QC filters ("filtered")
  • The locus either passes or fails QC filters ("raw")


FunctionalClass is a four-state stratification:

  • The locus is a synonymous site ("silent")
  • The locus is a missense site ("missense")
  • The locus is a nonsense site ("nonsense")
  • The locus is of any functional class ("any")


CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.


Degeneracy is a six-state stratification:

  • The underlying base position in the codon is 1-fold degenerate ("1-fold")
  • The underlying base position in the codon is 2-fold degenerate ("2-fold")
  • The underlying base position in the codon is 3-fold degenerate ("3-fold")
  • The underlying base position in the codon is 4-fold degenerate ("4-fold")
  • The underlying base position in the codon is 6-fold degenerate ("6-fold")
  • The underlying base position in the codon is degenerate at any level ("all")

See the [http://en.wikipedia.org/wiki/Genetic_code#Degeneracy Wikipedia page on degeneracy] for more information.


JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]


Novelty is a three-state stratification:

  • The locus overlaps the knowns comp track (usually the dbSNP track) ("known")
  • The locus does not overlap the knowns comp track ("novel")
  • The locus either overlaps or does not overlap the knowns comp track ("all")


CountVariants is an evaluation module that computes the following metrics:

Metric Definition
nProcessedLoci Number of processed loci
nCalledLoci Number of called loci
nRefLoci Number of reference loci
nVariantLoci Number of variant loci
variantRate Variants per loci rate
variantRatePerBp Number of variants per base
nSNPs Number of snp loci
nInsertions Number of insertion
nDeletions Number of deletions
nComplex Number of complex loci
nNoCalls Number of no calls loci
nHets Number of het loci
nHomRef Number of hom ref loci
nHomVar Number of hom var loci
nSingletons Number of singletons
heterozygosity heterozygosity per locus rate
heterozygosityPerBp heterozygosity per base pair
hetHomRatio heterozygosity to homozygosity ratio
indelRate indel rate (insertion count + deletion count)
indelRatePerBp indel rate per base pair
deletionInsertionRatio deletion to insertion ratio


CompOverlap is an evaluation module that computes the following metrics:

Metric Definition
nEvalSNPs number of eval SNP sites
nCompSNPs number of comp SNP sites
novelSites number of eval sites outside of comp sites
nVariantsAtComp number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same)
compRate percentage of eval sites at comp sites
nConcordant number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele)
concordantRate the concordance rate

Understanding the output of CompOverlap

A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):

 $ grep -v '##' dbsnp.vcf
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
 1       10327   rs112750067     T       C       .       .       ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132

Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:

 $ grep -v '##' eval_correct_allele.vcf
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT            001-6
 1       10327   .       T       C       5168.52 PASS    ...     GT:AD:DP:GQ:PL    0/1:357,238:373:99:3959,0,4059

Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:

 $ grep -v '##' eval_incorrect_allele.vcf
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT            001-6
 1       10327   .       T       A       5168.52 PASS    ...     GT:AD:DP:GQ:PL    0/1:357,238:373:99:3959,0,4059

Running VariantEval with just the CompOverlap module:

 $ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \
        -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \
        -L 1:10327 \
        -B:dbsnp,VCF dbsnp.vcf \
        -B:eval_correct_allele,VCF eval_correct_allele.vcf \
        -B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \
        -noEV \
        -EV CompOverlap \
        -o eval.table

We find that the eval.table file contains the following:

 $ grep -v '##' eval.table | column -t 
 CompOverlap  CompRod  EvalRod                JexlExpression  Novelty  nEvalVariants  nCompVariants  novelSites  nVariantsAtComp  compRate      nConcordant  concordantRate
 CompOverlap  dbsnp    eval_correct_allele    none            all      1              1              0           1                100.00000000  1            100.00000000
 CompOverlap  dbsnp    eval_correct_allele    none            known    1              1              0           1                100.00000000  1            100.00000000
 CompOverlap  dbsnp    eval_correct_allele    none            novel    0              0              0           0                0.00000000    0            0.00000000
 CompOverlap  dbsnp    eval_incorrect_allele  none            all      1              1              0           1                100.00000000  0            0.00000000
 CompOverlap  dbsnp    eval_incorrect_allele  none            known    1              1              0           1                100.00000000  0            0.00000000
 CompOverlap  dbsnp    eval_incorrect_allele  none            novel    0              0              0           0                0.00000000    0            0.00000000

As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.


TiTvVariantEvaluator is an evaluation module that computes the following metrics:

Metric Definition
nTi number of transition loci
nTv number of transversion loci
tiTvRatio the transition to transversion ratio
nTiInComp number of comp transition sites
nTvInComp number of comp transversion sites
TiTvRatioStandard the transition to transversion ratio for comp sites

Created 2012-07-23 23:57:11 | Updated 2012-07-23 23:57:11 | Tags: gatkdocs varianteval
Comments (0)

A new tool has been released!

Check out the documentation at VariantEval.

Created 2012-08-20 18:52:48 | Updated 2012-08-23 14:11:29 | Tags: unifiedgenotyper official baserecalibrator combinevariants haplotypecaller selectvariants varianteval release-notes
Comments (0)

Base Quality Score Recalibration

  • Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
  • Implemented support for SOLiD no call strategies other than throwing an exception.
  • Fixed smoothing in the BQSR bins.
  • Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.

Unified Genotyper

  • Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
  • UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
  • Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
  • In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
  • Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.

Haplotype Caller

  • Added LowQual filter to the output when appropriate.
  • Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
  • Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
  • Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
  • Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
  • Fixed bug where non-standard bases from the reference would cause errors.
  • Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.

Reduce Reads

  • Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
  • Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
  • Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.

Variant Eval

  • Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
  • Fixed incorrect allele counting in IndelSummary evaluation.

Combine Variants

  • Now outputs the first non-MISSING QUAL, instead of the maximum.
  • Now supports multi-threaded running (with the -nt argument).

Select Variants

  • Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
  • No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
  • If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).


  • Updated and improved the BadCigar read filter.
  • GATK now generates a proper error when a gzipped FASTA is passed in.
  • Various improvements throughout the BCF2-related code.
  • Removed various parallelism bottlenecks in the GATK.
  • Added support of X and = CIGAR operators to the GATK.
  • Catch NumberFormatExceptions when parsing the VCF POS field.
  • Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
  • Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
  • We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
  • Added support for handling complex events in ValidateVariants.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar remains at version 110.

Created 2015-11-25 19:52:15 | Updated | Tags: varianteval interpretation
Comments (2)

I used "varianteval" module to evaluate 33 samples of mine with a ref-data across 23 chromosomes. I expected the sum of nHomVar + nHets = nSNPs but is not!!!!! I don't know why the number of "nHets" is too much large? any thoughts or clue? thanks

allSample nVariantLoci nSNPs nInsertions nDeletions nHets nHomRef nHomVar nSingletons hetHomRatio all_chr1 51144 47364 1719 2061 152157 672 33997 22615 4.48 all_chr2 37491 34738 1246 1507 102477 257 19140 18172 5.35 all_chr3 35139 32591 1163 1385 98830 386 18342 16502 5.39 all_chr4 26399 24419 864 1116 71321 198 14864 12911 4.80 all_chr5 28058 25933 958 1167 76321 290 14099 13391 5.41 all_chr6 31222 28940 1009 1273 82483 243 16410 15084 5.03 all_chr7 28351 26331 907 1113 80464 221 14682 13448 5.48 all_chr8 22551 20968 701 882 63569 224 13134 10650 4.84 all_chr9 22478 20987 679 812 60268 198 11341 11055 5.31 all_chr10 27740 25691 924 1125 79158 182 15035 12875 5.26 all_chr11 24012 22309 762 941 66762 197 14378 11521 4.64 all_chr12 25239 23310 832 1097 71793 218 15187 11800 4.73 all_chr13 12494 11547 388 559 34506 113 7156 5945 4.82 all_chr14 15825 14640 534 651 42887 123 9137 7689 4.69 all_chr15 17554 16249 585 720 49245 133 10203 8329 4.83 all_chr16 22966 21529 656 781 63713 165 13000 10825 4.90 all_chr17 23983 22285 768 930 71715 236 15978 10776 4.49 all_chr18 9741 8935 374 432 25755 93 4743 4793 5.43 all_chr19 26262 24448 792 1022 81650 359 17700 11176 4.61 all_chr20 13068 12133 431 504 37924 155 7633 6197 4.97 all_chr21 7722 7220 212 290 20928 124 4497 3807 4.65 all_chr22 14151 13188 406 557 42289 156 8814 6180 4.80 all_chr23 5330 4869 221 240 10552 92 5177 2772 2.04

Created 2015-07-16 18:45:45 | Updated | Tags: varianteval sensitivity specificity combiegvcfs
Comments (1)

I have 33 samples (VCF files) resulted from HaplotypeCaller and like to evaluate them using dbSNP as a comp to get concordance_rate, sensitivity, and specificity. One way is to evaluate each sample separately. Another approach is to combine all 33 files using "CombineGVCFs" and then evaluate one VCF file. Which of these 2 methods do you think is feasible and acceptable for variant evaluation? why? Thanks

Created 2015-07-01 19:10:26 | Updated | Tags: varianteval specificity
Comments (13)

In the following article you explained about sensitivity, descripency rate and concordant rate. What about Specificity rate?


Could you please introduce an article or a reference to explain each parameters measures in the output of "VariantEval". Thanks

Created 2015-04-14 09:02:46 | Updated | Tags: varianteval
Comments (1)

The VariantEval provides many useful statistics for evaluation of the callsets. However, the documentations I could found in GATK website were somewhat out of date, resulting in some columns not easy to understand or guess. I have two questions: 1. What are the meanings of the nMNPs, nComplex, nSymbolic, nMixed, nHomDerived in a CountVariants table in a VariantEval report? 2. How can I found some formula or explanation of all the calculation in VariantEval?

Thanks a lot! Hiu

Created 2015-04-10 16:14:16 | Updated | Tags: varianteval
Comments (4)

How can I get more information about output of Variant-Evaluation tool? Is there any document or article for more details? Particularly more detail about specificity, sensitivity, TiTvratio, hethomratio, etc. Thanks.

Created 2015-01-27 20:07:05 | Updated 2015-01-27 20:09:02 | Tags: varianteval r
Comments (2)

VariantEval follows a strict format that is human readable but also the top few lines of each table more than hint that it's ready to be parsed by something else. Been wondering what that is, looks like Python... Is there a nice quick way to load all of these data into something like R or Python? I can imagine using grep to put each table into a file and load that into R but if the output was designed for a certain tool, it would be great to use that.

Thanks as always for the wonderful tools!

Example, what language or tool is this stuff for?


#:GATKTable:CompOverlap:The overlap between eval and comp sites

Created 2014-07-28 22:08:52 | Updated | Tags: combinevariants varianteval ignorefilter
Comments (8)

I am running CombineVariants followed by VariantEval on two VCF files (GATK version 3.1-1). I noticed that none of my "set=FilteredInAll" variants are included in the output of VariantEval. I suspect this is because the FILTER column is not "PASS" because the variants were not "PASS" in either of the input VCF files.

Is there an argument to pass into VariantEval so that it does not filter variants out based on the FILTER column? I was using the arguments from a guide article (http://gatkforums.broadinstitute.org/discussion/48/using-varianteval). I saw that in the comments you mentioned an --ignoreFilters argument, but when I tried to use it I got the GATK error: Argument with name 'ignoreFilters' isn't defined.

I can just change the FILTER column entries in the combined VCF file, but it seems like a problem that others might have as well.

Thank you!

Created 2014-07-11 11:46:43 | Updated | Tags: varianteval stratification
Comments (2)

Using the following command line: $ java -Xmx2g -jar /media/data/Applications/GenomeAnalysisTK.jar -R /media/data/GATK/hg19/ucsc.hg19.fasta -T VariantEval -noEV -ST FunctionalClass -o eval759_FunctionalClassreport --eval Sample_759.filtered.vcf -D /media/data/GATK/hg19/dbsnp_138.hg19.vcf

The tables show no data for missense, nonsense etc Eg



:GATKTable:CompOverlap:The overlap between eval and comp sites

CompOverlap CompRod EvalRod FunctionalClass JexlExpression Novelty nEvalVariants novelSites nVariantsAtComp compRate nConcordant concordantRate CompOverlap dbsnp eval all none all 2850055 55006 2795049 98.07 2531746 90.58 CompOverlap dbsnp eval all none known 2795049 0 2795049 100 2531746 90.58 CompOverlap dbsnp eval all none novel 55006 55006 0 0 0 0 CompOverlap dbsnp eval missense none all 0 0 0 0 0 0 CompOverlap dbsnp eval missense none known 0 0 0 0 0 0 CompOverlap dbsnp eval missense none novel 0 0 0 0 0 0 CompOverlap dbsnp eval nonsense none all 0 0 0 0 0 0 CompOverlap dbsnp eval nonsense none known 0 0 0 0 0 0 CompOverlap dbsnp eval nonsense none novel 0 0 0 0 0 0 CompOverlap dbsnp eval silent none all 0 0 0 0 0 0 CompOverlap dbsnp eval silent none known 0 0 0 0 0 0 CompOverlap dbsnp eval silent none novel 0 0 0 0 0 0

This is my first foray into GATK but I thought everything in the command was OK. Is there something wrong with selection of the specific dbSNP file or reference that the SNP are not beng split into classes?

Many thanks.

Created 2014-04-17 16:46:28 | Updated | Tags: varianteval
Comments (3)


I have genotype calls on an exome sequencing dataset, and have used the GATK TiTvVariantEvaluator tool to estimate TiTv ratios for my samples. I'm however having trouble understanding the "TiTvRatioStandard" column, and how I interpret it. My TiTv ratios for known variants are 2.56, below the generally accepted cutoff of 2.8. The TiTv ratios I get for novel variants is also very low (~1.25), which should bring up a red flag, but the values in the "TiTvRatioStandard" column for novel variants is 1.04, and for known variants is 2.56.

I haven't been able to understand what these values mean, and whether I should use them as threshold/quality measures. If someone could point me to more information on this, it would be very helpful.


Created 2013-11-20 15:52:46 | Updated | Tags: varianteval stratification
Comments (1)

I'd like to be able to perform stratifications in a multi sample vcf, by values that are in the format fields. Almost all of the existing stratifications are based on site specific information rather than sample specific ones. One stratification in particular that I would like to perform is by ReadDepth. I would like to be able to differentiate for instance, all samples with ReadDepth greater than 20. This works in single sample vcfs, but it produces strange results in ones with multiple samples, since each VariantContext contains multiple genotypes.

Melting my vcfs and reporting multiple lines for each position seems possible, but ugly. Splitting vcfs so that each sample is in it's own vcf is also possible and ugly. What is the recommended method for dealing with this sort of stratification?

Created 2013-08-28 21:25:28 | Updated 2013-08-28 21:41:49 | Tags: validatevariants varianteval vcf strelka
Comments (10)

Strelka produces vcf files that GATK has issues with. The files pass vcftools validation, which according to the docs is the official validation, they do not pass ValidateVariants. VariantEval can't read them either. I'm unsure where the bug lives.

vcf file looks like this:

##startTime=Thu Aug  1 15:23:54 2013
##content=strelka somatic indel calls
##INFO=<ID=QSI,Number=1,Type=Integer,Description="Quality score for any somatic variant, ie. for the ALT haplotype to be present at a significantly different frequency in the tumor and normal">
##INFO=<ID=TQSI,Number=1,Type=Integer,Description="Data tier used to compute QSI">
##INFO=<ID=NT,Number=1,Type=String,Description="Genotype of the normal in all data tiers, as used to classify somatic variants. One of {ref,het,hom,conflict}.">
##INFO=<ID=QSI_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT">
##INFO=<ID=TQSI_NT,Number=1,Type=Integer,Description="Data tier used to compute QSI_NT">
##INFO=<ID=SGT,Number=1,Type=String,Description="Most likely somatic genotype excluding normal noise states">
##INFO=<ID=RU,Number=1,Type=String,Description="Smallest repeating sequence unit in inserted or deleted sequence">
##INFO=<ID=RC,Number=1,Type=Integer,Description="Number of times RU repeats in the reference allele">
##INFO=<ID=IC,Number=1,Type=Integer,Description="Number of times RU repeats in the indel allele">
##INFO=<ID=IHP,Number=1,Type=Integer,Description="Largest reference interupted homopolymer length intersecting with the indel">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation">
##INFO=<ID=OVERLAP,Number=0,Type=Flag,Description="Somatic indel possibly overlaps a second indel.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth for tier1">
##FORMAT=<ID=DP2,Number=1,Type=Integer,Description="Read depth for tier2">
##FORMAT=<ID=TAR,Number=2,Type=Integer,Description="Reads strongly supporting alternate allele for tiers 1,2">
##FORMAT=<ID=TIR,Number=2,Type=Integer,Description="Reads strongly supporting indel allele for tiers 1,2">
##FORMAT=<ID=TOR,Number=2,Type=Integer,Description="Other reads (weak support or insufficient indel breakpoint overlap) for tiers 1,2">
##FORMAT=<ID=DP50,Number=1,Type=Float,Description="Average tier1 read depth within 50 bases">
##FORMAT=<ID=FDP50,Number=1,Type=Float,Description="Average tier1 number of basecalls filtered from original read depth within 50 bases">
##FORMAT=<ID=SUBDP50,Number=1,Type=Float,Description="Average number of reads below tier1 mapping quality threshold aligned across sites within 50 bases">
##FILTER=<ID=Repeat,Description="Sequence repeat of more than 8x in the reference sequence">
##FILTER=<ID=iHpol,Description="Indel overlaps an interupted homopolymer longer than 14x in the reference sequence">
##FILTER=<ID=BCNoise,Description="Average fraction of filtered basecalls within 50 bases of the indel exceeds 0.3">
##FILTER=<ID=QSI_ref,Description="Normal sample is not homozygous ref or sindel Q-score < 30, ie calls with NT!=ref or QSI_NT < 30">
##cmdline=/xchip/cga_home/louisb/Strelka/strelka_workflow_1.0.7/libexec/consolidateResults.pl --config=/xchip/cga/benchmark/testing/full-run/somatic-benchmark/spiked/Strelka_NDEFGHI_T12345678_0.8/config/run.config.ini
1   797126  .   GTAAT   G   .   PASS    IC=1;IHP=2;NT=ref;QSI=56;QSI_NT=56;RC=2;RU=TAAT;SGT=ref-        >het;SOMATIC;TQSI=1;TQSI_NT=1   DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50   47:47:48,49:0,0:3,3:48.72:0.00:0.00 62:62:36,39:17,19:9,9:42.49:0.21:0.00``

The output I get from ValidateVariants is java -jar ~/Workspace/gatk-protected/dist/GenomeAnalysisTK.jar -T ValidateVariants --variant strelka1.vcf -R ~/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:19:45,289 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,291 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/22 11:08:15 INFO 17:19:45,291 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:19:45,291 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:19:45,295 HelpFormatter - Program Args: -T ValidateVariants --variant strelka1.vcf -R /Users/louisb/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:19:45,295 HelpFormatter - Date/Time: 2013/08/28 17:19:45 INFO 17:19:45,295 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,295 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,300 ArgumentTypeDescriptor - Dynamically determined type of strelka1.vcf to be VCF INFO 17:19:45,412 GenomeAnalysisEngine - Strictness is SILENT INFO 17:19:45,513 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:19:45,533 RMDTrackBuilder - Loading Tribble index from disk for file strelka1.vcf INFO 17:19:45,615 GenomeAnalysisEngine - Preparing for traversal INFO 17:19:45,627 GenomeAnalysisEngine - Done preparing for traversal INFO 17:19:45,627 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:19:45,627 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:19:46,216 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.7-1-g42d771f): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: File /Users/louisb/Workspace/strelkaVcfDebug/strelka1.vcf fails strict validation: one or more of the ALT allele(s) for the record at position 1:797126 are not observed at all in the sample genotypes ##### ERROR ------------------------------------------------------------------------------------------

output from VariantEval is:

java -jar ~/Workspace/gatk-protected/dist/GenomeAnalysisTK.jar -T VariantEval --eval strelka1.vcf -R ~/cga_home/reference/human_g1k_v37_decoy.fasta
INFO  17:15:44,333 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,335 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/22 11:08:15
INFO  17:15:44,335 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  17:15:44,335 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  17:15:44,339 HelpFormatter - Program Args: -T VariantEval --eval strelka1.vcf -R /Users/louisb/cga_home/reference/human_g1k_v37_decoy.fasta
INFO  17:15:44,339 HelpFormatter - Date/Time: 2013/08/28 17:15:44
INFO  17:15:44,339 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,339 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,349 ArgumentTypeDescriptor - Dynamically determined type of strelka1.vcf to be VCF
INFO  17:15:44,476 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:15:44,603 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  17:15:44,623 RMDTrackBuilder - Loading Tribble index from disk for file strelka1.vcf
INFO  17:15:44,710 GenomeAnalysisEngine - Preparing for traversal
INFO  17:15:44,722 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:15:44,723 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  17:15:44,831 VariantEval - Creating 3 combinatorial stratification states
INFO  17:15:45,382 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: BUG: Unexpected genotype type: [NORMAL NA DP 47 {DP2=47, DP50=48.72, FDP50=0.00, SUBDP50=0.00, TAR=48,49, TIR=0,0, TOR=3,3}]
    at org.broadinstitute.sting.gatk.walkers.varianteval.evaluators.CountVariants.update1(CountVariants.java:201)
    at org.broadinstitute.sting.gatk.walkers.varianteval.util.EvaluationContext.apply(EvaluationContext.java:88)
    at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:455)
    at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:124)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-1-g42d771f):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: BUG: Unexpected genotype type: [NORMAL NA DP 47 {DP2=47, DP50=48.72, FDP50=0.00, SUBDP50=0.00, TAR=48,49, TIR=0,0, TOR=3,3}]
##### ERROR ------------------------------------------------------------------------------------------

Created 2013-07-26 21:27:21 | Updated | Tags: varianteval
Comments (3)

Hi GATK Team,

I am heavy user of the VariantEval evaluators (particularly GenotypeConcordance) and tracked some unexpected results to the Sample stratification. What is the motivation for not stratifying the comp RODs by the sample? This seems to be a very conscious choice so I was hoping to understand the background of that choice

The relevant section of VariantEval.java is:

for ( final RodBinding<VariantContext> compRod : comps ) {
                            // no sample stratification for comps
                            final HashMap<String, Collection<VariantContext>> compSetHash = compVCs.get(compRod);
                            final Collection<VariantContext> compSet = (compSetHash == null || compSetHash.size() == 0) ? Collections.<VariantContext>emptyList() : compVCs.get(compRod).values().iterator().next();

The effect for me is that many spurious genotypes get included in cases where is a comp variant, but not eval variant.


Created 2013-06-21 21:44:03 | Updated | Tags: haplotypecaller varianteval gatkreport
Comments (5)


I just finished running a fairly large number of WGS samples through HaplotypeCaller and I've been using VariantEval to look at some summary stats on these samples. I've noticed that under '#:GATKTable:VariantSummary:1000 Genomes Phase I summary of variants table' there's a section on structural variations and that apparently I'm getting about 3500 in one of my samples. Here's the actual section of the table in question:

#:GATKTable:VariantSummary:1000 Genomes Phase I summary of variants table
VariantSummary  CompRod  EvalRod  JexlExpression  Novelty  nSamples  nProcessedLoci  nSNPs    TiTvRatio  SNPNoveltyRate  nSNPsPerSample  TiTvRatioPerSample  SNPDPPerSample  nIndels  IndelNoveltyRate  nIndelsPerSample  IndelDPPerSample  nSVs  SVNoveltyRate  nSVsPerSample
VariantSummary  dbsnp    vcf1     none            all             1      3095693981  3446166       2.08            1.34         3446166                2.08             0.0   962028             15.33            962028               0.0  3282          73.58           3282
VariantSummary  dbsnp    vcf1     none            known           1      3095693981  3399907       2.08            0.00         3399907                2.08             0.0   814506              0.00            814506               0.0   867           0.00            867
VariantSummary  dbsnp    vcf1     none            novel           1      3095693981    46259       1.71          100.00           46259                1.71             0.0   147522            100.00            147522               0.0  2415         100.00           2415

I didn't think that HaplotypeCaller even looked for structural variations, so I tried to find these structural variations in the VCF, hoping they were encoded as described here and I couldn't find anything. Could someone tell me why VariantEval is showing a number of structural variations but the actual VCF isn't finding any? Does VariantEval just interpret a sufficiently large indel as a SV? If so, I can understand why it may call some structural variations considering there are indels longer than 1k bp in the indels of the sample.



Created 2013-06-10 21:09:29 | Updated | Tags: varianteval genotypeconcordance stratification
Comments (3)

I would like to evaluate variant calls to produce a plot (psuedo-ROC) of sensitivity vs. specificity (or concordance, etc) when I condition on a minimum/maximum value for a particular metric (coverage, genotype quality, etc.). I can do this by running VariantEval or GenotypeConcordance multiple times, once for each cutoff value, but this is inefficient, since I believe I should be able to compute these values in one pass. Alternatively, if there was a simple tool to annotate each variant as concordance or discordant, I could tabulate the results myself. I would like to rely upon GATK's variant comparison logic to compare variants (especially indels). Any thoughts on if current tools can be parameterized, or adapted for these purposes?

Thanks for your help in advance,


Created 2013-04-04 08:39:18 | Updated | Tags: varianteval
Comments (7)


I have processed 10 whole-exome samples using the GATK best practices workflow (GATK v2.4-3-g2a7af43). I am currently evaluating my variant call set (generated from HaplotypeCaller) with OMNI 2.5 SNP array (comparison set) and dbSNP 137.

I have included 2 rows from the Ti/Tv Variant Evaluator table:

CompRod EvalRod Novelty Sample nTi nTv tiTvRatio nTiInComp nTvInComp TiTvRatioStandard OMNI MyCalls all all 79945 30322 2.64 993588 274219 3.62 dbsnp MyCalls all all 79945 30322 2.64 30214009 15253850 1.98

According to literature survey, the Ti/Tv ratio should be approximately 2.1 for whole genome sequencing and 2.8 for whole exome sequencing. Since I am getting Ti/Tv of 2.64 for exome, does this indicate false positives in the data? Also, what could be the rationale for getting such high TiTvRatioStandard for the OMNI whole genome data?


Created 2013-01-24 22:13:52 | Updated | Tags: varianteval jexl
Comments (2)


I'm currently running variantEval to count up variants per individual stratified by a variety of annotations.

My GATK call looks like:

java -jar /humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar \ -T VariantEval \ -R Homo_sapiens_assembly19.fasta \ -o output.txt \ -L input.vcf \ -eval input.vcf \ -ST Sample -noST \ -noEV -EV CountVariants \ -ST JexlExpression --select_names "nonsynon" --select_exps "resource.VAT_CDS == 'nonsynonymous' && resource.FOUNDERS_FRQ > 0.05" \ -ST JexlExpression --select_names "synon" --select_exps "resource.VAT_CDS == 'synonymous' && resource.FOUNDERS_FRQ > 0.05" ...

where the VAT_CDS section of the INFO field in the VCF has a functional annotation or is set to "na" if an annotation is unavailable. I'm getting the following error:

ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Invalid command line: Invalid JEXL expression detected for synon with message For input string: "nan"
ERROR ------------------------------------------------------------------------------------------

but weirdly the error is not consistent (my data is stratified by chromosome and most chromosomes will run without throwing the error while one or two chromosomes do exhibit the error. Do you have any ideas what's causing this behavior?


Created 2013-01-16 15:05:06 | Updated 2013-01-17 16:23:28 | Tags: varianteval exome community
Comments (1)


I found the materials of the BroadE Workshop very helpful, especially the slide on analyzing variant calls using VariantEval, because there is not much documentation for it on GATK site. As an example 62 whole genome sequencing samples from north Europe were evaluated together with 1000G FIN samples, and also the polymorphic and monomorphic sites on the 1000G genotype chip were used as comparator. I would like very much to do the same for our whole exome data, the question is: is there good quality whole exome data that I can use to evaluate our own exome data?

I have checked the NHLBI ESP project Exome Variant Server site, the vcf files can be downloaded doesn't have the genotype data.

Thanks in advance!

Created 2013-01-08 13:47:41 | Updated 2013-01-09 03:06:09 | Tags: varianteval
Comments (2)

Hello, I've just started using VariantEval. It seems very useful and provides a good deal of output. However, it's not obvious to me what all of the fields mean. Do you know when there will be documentation of the output?

For example, some things I'd like to know:


  • What is the difference between TiTvRatio and TiTvRatioPerSample?
  • What is SNPNoveltyRate? (In general I don't understand the rate fields)


  • What is TiTvRatioStandard? What are the nTiDerived and nTvDerived fields?


  • What are variantRate and variantRatePerBp? Is variantRatePerBp the average number of bp between variants? (If anything it seems like the names for these should be reversed.)
  • What are heterozygosity and heterozygosityPerBp?

What is the --dbsnp parameter used for? Is this redundant with --comp?

Many thanks for your help,


Created 2012-12-19 13:02:10 | Updated | Tags: varianteval multi-sample variant
Comments (4)


I want to know what's the best way to use VariantEval to get statistics for each sample in a multisample VCF file. If I call it like this: java -jar GenomeAnalysisTK.jar \ -R ucsc.hg19.fasta \ -T VariantEval \ -o multisample.eval.gatkreport \ --eval annotated.combined.vcf.gz \ --dbsnp dbsnp_137.hg19.vcf where annotated.combined.vcf.gz is a VCF file that contains ~1Mio variants for ~800 samples I get statistics for all samples combined, e.g.

#:GATKReport.v1.1:8 #:GATKTable:11:3:%s:%s:%s:%s:%s:%d:%d:%d:%.2f:%d:%.2f:; #:GATKTable:CompOverlap:The overlap between eval and comp sites CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants ... CompOverlap dbsnp eval none all 471704 191147
CompOverlap dbsnp eval none known 280557 0 CompOverlap dbsnp eval none novel 191147 191147

But I would like to get one such entry per sample. Is there an easy way to do this?

Thanks, Thomas

Created 2012-12-04 20:12:44 | Updated 2012-12-04 20:15:44 | Tags: varianteval jexl genotype
Comments (1)

While running VariantEval, I'm trying to stratify by a JexlExpression by setting using

-ST Sample -ST JexlExpression -select "GQ>20"

This fails with a "variable does not exist" error despite the GQ existing in all genotypes in the vcf. Looking at the code it seems that the pathway that loads the JexlExpression in the VariantEval class specifically does not provide the genotype as context (only the VariantContext) and thus, the context for the Jexl does not include GT and the error is produced.

My question is: Is this a feature or a bug? It seems possible to add the genotype (when the VC only has one, or loop over the genotypes and either OR or AND the results (perhaps another input similar to -isr?), but perhaps I'm missing something subtle?

Would you like this behavior or are you happy with the current operation of jexlExpression?


Created 2012-11-22 21:09:04 | Updated | Tags: documentation selectvariants varianteval
Comments (3)


I've a couple of essentially documentation-centric questions...

Firstly, the SelectVariants documentation describes selecting 1000 random variants from a vcf using '-number 1000', however when I try that (with the command "java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta --variant variants.vcf -o 1000.vcf -number 1000") it produces the error 'Argument with the name number isn't defined'. Trying --number instead doesn't make any difference, while the --help output does not list this argument (GATK 2.2.2). It this option no longer available?

Secondly, the 'gatkforums.broadinstitute.org/discussion/48/using-varianteval#Understanding_Genotype_Concordance_values_from_Variant_Eval' section of the 'Using VariantEval' page has a series of images explaining the concordance values, however the images are missing. Please could these be restored?

Many thanks, James

Created 2012-09-24 21:04:12 | Updated 2012-09-25 15:53:45 | Tags: varianteval genotypeconcordance
Comments (2)


I am using VariantEval --evalModule GenotypeConcordance in order to establish concordance and sensitivity metrics against a HapMap reference. In the resulting GATK report I obtain the following fields for a given SNP category (example with HETs):

GenotypeConcordance  CompRod  EvalRod  JexlExpression  Novelty  variable                                value---
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_HET                       6220
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_HOM_REF                      0
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_HOM_VAR                     20
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_MIXED                        0
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_NO_CALL                    318
GenotypeConcordance  comp     eval     none            all      n_true_HET_called_UNAVAILABLE                  0

What is the meaning of the _MIXED and _UNAVAILABLE fields?

Thx, Gene

Created 2012-08-23 20:13:34 | Updated 2012-09-04 16:18:19 | Tags: varianteval
Comments (3)

I'm on unstable v2.1-61-gde29ed5, Compiled 2012/08/23 16:03:06

I run VariantEval on the file below and get this:

java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
       at org.broadinstitute.sting.utils.variantcontext.CommonInfo.getAttributeAsInt(CommonInfo.java:212)
       at org.broadinstitute.sting.utils.variantcontext.VariantContext.getAttributeAsInt(VariantContext.java:600)
                at org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.AlleleCount.getRelevantStates(AlleleCount.java:50)
                at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.getEvaluationContexts(VariantEval.java:481)
                at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:409)
                at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:91)
                at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65)
                at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18)
                at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:72)
                at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:268)
                at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
                at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
                at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
                at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)`

The file is this (after header)

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  A3N     A4N     A5N     A6N     A8N     A9N     H10N    H15N    H17N    H26N    H27N    H32N    H52N    H53N   H54N     H55N    H56N    H57N    H58N    H61N    H62N    H6N     H7N
    1       901922  rs62639980      G       A,C     1332.23 PASS    AC=1,2;AF=6.849e-03,0.014;AN=146;BaseQRankSum=-3.705;DB;DP=2675;DS;Dels=0.00;FS=3.860;HaplotypeScore=1.6813;InbreedingCoeff=-0.0001;MLEAC=2,2;MLEAF=0.014,0.014;MQ=58.62;MQ0=1;MQRankSum=2.225;QD=7.84;ReadPosRankSum=3.761;SB=-5.136e+02;VQSLOD=3.9184;culprit=HaplotypeScore  GT:AD:DP:GQ:PL  0/0:1,0,0:59:99:0,178,2360,178,2360,2360        0/0:60,0,0:60:99:0,156,2055,156,2055,2055       0/0:60,0,0:60:99:0,160,2191,160,2191,2191       0/0:1,0,0:64:99:0,193,2516,193,2516,2516        0/0:1,0,0:47:99:0,141,1776,141,1776,1776        0/0:1,0,0:62:99:0,187,2480,187,2480,2480        0/0:59,1,0:60:99:0,119,1965,147,1968,1996       0/0:56,0,0:56:99:0,135,1857,135,1857,1857      0/0:1,0,0:57:99:0,172,2280,172,2280,2280 0/0:1,0,0:75:99:0,226,3000,226,3000,3000        0/0:8,0,0:71:99:0,214,2852,214,2852,2852        0/0:60,0,0:60:99:0,147,1942,147,1942,1942      0/0:47,0,0:47:99:0,123,1654,123,1654,1654        0/0:1,0,0:56:99:0,169,2240,169,2240,2240        0/0:1,0,0:55:99:0,166,2122,166,2122,2122        0/2:37,0,23:60:99:647,734,1844,0,1110,1050      0/0:1,0,0:71:99:0,214,2840,214,2840,2840        0/2:44,0,16:60:99:377,483,1801,0,1319,1277      0/0:57,0,0:57:99:0,141,1914,141,1914,1914       0/0:1,0,0:53:99:0,160,2154,160,2154,2154        0/0:1,0,0:74:99:0,223,3008,223,3008,3008        0/0:1,0,0:49:99:0,147,1960,147,1960,1960        0/0:1,0,0:76:99:0,229,2988,229,2988,2988`

The file was created by first calling my ~20 samples with 50 random samples from 1kg and then subsetting to my samples using SelectVariants.

Created 2012-08-14 01:32:37 | Updated 2012-08-14 01:32:37 | Tags: varianteval
Comments (9)

Hi, I'm running 1.6-512-gafa4399 (unstable).

When running VariantEval, I got an error: "Couldn't find state for 88 at node org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratNode"

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Couldn't find state for 88 at node org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratNode@1a7df60a
at org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratNode.find(StratNode.java:117)
at org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratificationManager.getKeys(StratificationManager.java:208)
at org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratificationManager.values(StratificationManager.java:260)
at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEvalWalker.getEvaluationContexts(VariantEvalWalker.java:480)
at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEvalWalker.map(VariantEvalWalker.java:406)
at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEvalWalker.map(VariantEvalWalker.java:89)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:63)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:257)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)

This happened on running VariantEval like this:

-T VariantEval -L /xchip/cga/reference/hg19/whole_exome_agilent_1.1_refseq_plus_3_boosters_plus_10bp_padding_minus_mito.Homo_sapiens_assembly19.targets.interval_list -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta -nt 1 -o myFile.byAC.eval -eval myFile.vcf -D dbsnp_132_b37.leftAligned.vcf -gold /humgen/gsa-hpprojects/GATK/bundle/current/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf -ST EvalRod -ST CompRod -ST Novelty -ST FunctionalClass -ST AlleleCount -noST -EV TiTvVariantEvaluator -EV CountVariants -EV CompOverlap -EV IndelSummary -noEV 

Do you know what may be causing this error? If it helps, I can distill a small failing vcf. ./adam