GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
I ran the HaplotypeCaller, VariantAnnotator, and Variant Validatoor on chr3 locations from a human tumor sample.
The HaplotypeCaller command line is:
gatk="/usr/local/gatk/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar"
#Fasta from the gz in the resource bundle
indx="/home/ref/ucsc.hg19.fasta"
dbsnp="/fdb/GATK_resource_bundle/hg19-1.5/dbsnp_135.hg19.vcf"
java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T HaplotypeCaller \
-I chrom_bams/286T.chr3.bam \
-o hapc_vcfs/286T.chr3.raw.vcf
The VariantAnnotator command line is:
java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T VariantAnnotator \
--dbsnp $dbsnp --alwaysAppendDbsnpId \
-A BaseQualityRankSumTest -A DepthOfCoverage \
-A FisherStrand -A HaplotypeScore -A InbreedingCoeff \
-A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth \
-A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions \
-A TandemRepeatAnnotator \
--variant:vcf hapc_vcfs/286T.chr3.raw.vcf \
--out varanno_vcfs/286T.chr3.va.vcf
This all works nicely, but I go back and use ValidateVariants just to be sure:
java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T ValidateVariants \
--dbsnp ${dbsnp} \
--variant:vcf varanno_vcfs/286T.chr3.va.vcf \
1> report/ValidateVariants/286T.chr3.va.valid.out \
2> report/ValidateVariants/286T.chr3.va.valid.err &
An issue arises with a rsID that is flagged as not being present in dbSNP.
...fails strict validation: the rsID rs67850374 for the record at position chr3:123022685 is not in dbSNP
I realize this is an error message that generally would not generally qualify as an issue to post to these forums, however it is an error that seems to be generated by the Haplotype caller, illuminated by VariantAnnotator, and caught by the ValidateVariants.
The first 7 fields of the offending line in the 286T.chr3.va.vcf can be found using: cat 286T.chr3.va.vcf | grep rs67850374
chr3 123022685 rs67850374;rs72184829 AAAGAGAAGAGAAGAG A 1865.98 .
There is a corresponding entry in the dbsnp_135.hg19.vcf file: cat $dbsnp | grep rs67850374
chr3 123022685 rs67850374;rs72184829 AA A,AAAGAGAAGAG,AAAGAGAAGAGAAGAGAAGAG . PASS
My initial guess is that this is caused by a disagreement in the reference and variant fields between the two annotations. From what I can gather the call to the variantcontext function validateRSIDs() has a call to validateAlternateAlleles(). I assume this is what throws the error that is then caught and reported as "...fails strict validation..."
The UCSC genome browser for hg19 does show the specified position to be AA. It seems as thought the HaplotypeCaller simply used a different reference than dbsnp in this case.
The reference file supplied to HaplotypeCaller was the same as to VariantAnnotator and ValidateVariants. I did not supply the dbsnp argument to the HaplotypeCaller as I planned on doing all annotations after the initial variant calling, and the documentation states that the information is not utilized in the calculations. It seems as though this is a difference in between the reference assembly for dbSNP and the the reference supplied by the resource bundle.
My questions are:
As it stands, I am simply going to discard the offending lines manually. There are less than twenty in the entire exome sequencing of this particular tumor-normal sequencing. However, it seems like this issue will likely arise again. I will check the dbSNP VCF for places where the reference differs from the sequence in hg19. At least that should give me an estimate of the number of times this will arise and the locations to exclude from the variant calls.
-- Colin
doesn't check the header if it conforms
By mistake I just added some comments to the header like
and ValidateVariants(1.6.2) confirmed that it is valid vcf. but vcf-tools complained about it
BTW: how do I write comments? Do I have to use
??
I am facing this error, when I try to validate the variants (ValidateVariants) of a vcf file which is produced through GATK just after UnifiedGenotyper. I am using GenomeAnalysisTK-2.3-6-gebbba25 and dbsnp_137.hg19.vcf. These variants are annotated by DepthOfCoverage, aplotypeScore, ,InbreedingCoeff and LowMQ ...
Basically, I generate two VCF files using UnifiedGenotyper separately, one for SNP and the other for INDEL.
the error for both is about the Allele Count (AC) tag: ##### ERROR MESSAGE: File F93.snp.vcf fails strict validation: the Allele Count (AC) tag is incorrect for the record at position chr1:1225579, 1 vs. 1
I appreciate your comments,