Tagged with #validatevariants
0 documentation articles | 1 event or announcement | 3 forum discussions


Sorry, there are no publicly available documents of this type with the tag #validatevariants. Try one of the other types.

GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Base Quality Score Recalibration

  • Improved the algorithm around homopolymer runs to use a "delocalized context".
  • Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
  • Fixed bug where the tool failed for reads that begin with insertions.
  • Fixed bug in the scatter-gather functionality.
  • Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).

Unified Genotyper

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
  • Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
  • Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
  • Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
  • Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
  • The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
  • Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
  • Generalized ploidy model now handles reference calls correctly.

Haplotype Caller

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Now requires at least 10 samples to merge variants into complex events.

Variant Annotator

  • Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.

Reduce Reads

  • Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
  • Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
  • Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.

Variant Filtration

  • Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.

Variant Eval

  • AlleleCount stratification now supports records with ploidy other than 2.

Combine Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Now outputs the first non-missing QUAL, not the maximum.

Select Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Removed the -number argument because it gave biased results.

Validate Variants

  • Added option to selectively choose particular strict validation options.
  • Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
  • improved the error message around unused ALT alleles.

Somatic Indel Detector

  • Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).

Miscellaneous

  • New CPU "nano" parallelization option (-nct) added GATK-wide (see docs for more details about this cool new feature that allows parallelization even for Read Walkers).
  • Fixed raw HapMap file conversion bug in VariantsToVCF.
  • Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
  • Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
  • Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
  • Fixed bug in BCF2 writer for case where all genotypes are missing.
  • Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
  • Fixed bug in Phase By Transmission when there are no likelihoods present.
  • Fixed bug in fasta .fai generation.
  • Updated and improved version of the BadCigar read filter.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar remains at version 110.

I ran the HaplotypeCaller, VariantAnnotator, and Variant Validatoor on chr3 locations from a human tumor sample.

The HaplotypeCaller command line is:

gatk="/usr/local/gatk/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar"
#Fasta from the gz in the resource bundle
indx="/home/ref/ucsc.hg19.fasta" 
dbsnp="/fdb/GATK_resource_bundle/hg19-1.5/dbsnp_135.hg19.vcf"

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T HaplotypeCaller \
 -I chrom_bams/286T.chr3.bam \
 -o hapc_vcfs/286T.chr3.raw.vcf 

The VariantAnnotator command line is:

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T VariantAnnotator \
     --dbsnp $dbsnp  --alwaysAppendDbsnpId \
    -A BaseQualityRankSumTest -A DepthOfCoverage \
    -A FisherStrand -A HaplotypeScore -A InbreedingCoeff \
    -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth \
    -A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions \
    -A TandemRepeatAnnotator \
    --variant:vcf hapc_vcfs/286T.chr3.raw.vcf \
    --out varanno_vcfs/286T.chr3.va.vcf

This all works nicely, but I go back and use ValidateVariants just to be sure:

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T ValidateVariants \
   --dbsnp ${dbsnp} \
   --variant:vcf varanno_vcfs/286T.chr3.va.vcf \
    1> report/ValidateVariants/286T.chr3.va.valid.out \
    2> report/ValidateVariants/286T.chr3.va.valid.err &

An issue arises with a rsID that is flagged as not being present in dbSNP.

...fails strict validation: the rsID rs67850374 for the record at position chr3:123022685 is not in dbSNP

I realize this is an error message that generally would not generally qualify as an issue to post to these forums, however it is an error that seems to be generated by the Haplotype caller, illuminated by VariantAnnotator, and caught by the ValidateVariants.

The first 7 fields of the offending line in the 286T.chr3.va.vcf can be found using: cat 286T.chr3.va.vcf | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AAAGAGAAGAGAAGAG        A       1865.98 .

There is a corresponding entry in the dbsnp_135.hg19.vcf file: cat $dbsnp | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AA      A,AAAGAGAAGAG,AAAGAGAAGAGAAGAGAAGAG     .  PASS

My initial guess is that this is caused by a disagreement in the reference and variant fields between the two annotations. From what I can gather the call to the variantcontext function validateRSIDs() has a call to validateAlternateAlleles(). I assume this is what throws the error that is then caught and reported as "...fails strict validation..."

The UCSC genome browser for hg19 does show the specified position to be AA. It seems as thought the HaplotypeCaller simply used a different reference than dbsnp in this case.

The reference file supplied to HaplotypeCaller was the same as to VariantAnnotator and ValidateVariants. I did not supply the dbsnp argument to the HaplotypeCaller as I planned on doing all annotations after the initial variant calling, and the documentation states that the information is not utilized in the calculations. It seems as though this is a difference in between the reference assembly for dbSNP and the the reference supplied by the resource bundle.

My questions are:

  1. Is this really a problem that arises from slightly different reference assemblies?
  2. Is the hg19-1.5 reference fasta different from any other hg19 reference fasta?
  3. Is there at tool that I have missed that would have prevented this error and allowed the pipeline to continue without error?"
  4. Will this strict validation failure cause problems for the VariantRecalibrator?

As it stands, I am simply going to discard the offending lines manually. There are less than twenty in the entire exome sequencing of this particular tumor-normal sequencing. However, it seems like this issue will likely arise again. I will check the dbSNP VCF for places where the reference differs from the sequence in hg19. At least that should give me an estimate of the number of times this will arise and the locations to exclude from the variant calls.

-- Colin

doesn't check the header if it conforms

=

By mistake I just added some comments to the header like

commtent text bla bla...

and ValidateVariants(1.6.2) confirmed that it is valid vcf. but vcf-tools complained about it

BTW: how do I write comments? Do I have to use

comment= comment text

??

I am facing this error, when I try to validate the variants (ValidateVariants) of a vcf file which is produced through GATK just after UnifiedGenotyper. I am using GenomeAnalysisTK-2.3-6-gebbba25 and dbsnp_137.hg19.vcf. These variants are annotated by DepthOfCoverage, aplotypeScore, ,InbreedingCoeff and LowMQ ...

Basically, I generate two VCF files using UnifiedGenotyper separately, one for SNP and the other for INDEL.

the error for both is about the Allele Count (AC) tag: ##### ERROR MESSAGE: File F93.snp.vcf fails strict validation: the Allele Count (AC) tag is incorrect for the record at position chr1:1225579, 1 vs. 1

I appreciate your comments,