Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.
In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.
If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.
If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.
And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!
Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.
|Tool||dbSNP 129 -||- dbSNP >132 -||- Mills indels -||- 1KG indels -||- HapMap -||- Omni|
These tools require known indels passed with the
-known argument to function properly. We use both the following files:
This tool requires known SNPs and indels passed with the
-knownSites argument to function properly. We use all the following files:
These tools do NOT require known sites, but if SNPs are provided with the
-dbsnp argument they will use them for variant annotation. We use this file:
For VariantRecalibrator, please see the FAQ article on VQSR training sets and arguments.
This tool requires known SNPs passed with the
-dbsnp argument to function properly. We use the following file:
Dear GATK team,
Would you please clarify that, based on your experience or the logic used in the realignment algorithm, which option between using dbSNP, 1K gold standard (mills...), or "no known dbase" might result in a more accurate set of indels in the Indel-based realignment stage (speed and efficiency is not my concern).
Based on the documentation I found on your site, the "known" variants are used to identify "intervals" of interest to then perform re-alignment around indels. So, it makes sense to me to use as many number of indels as possible (even if they are unreliable and garbage such as many of those found in dbSNP) in addition to those more accurate calls found in 1K gold-standard datasets for choosing the intervals. After all, that increases he number of indel regions to be investigated and therefore potentially increase the accuracy. Depending on your algorithm logic, also, it seems that providing no known dbase would increase the chance of investigating more candidates of mis-alignment and therefore improving the accuracy.
But if your logic uses the "known" indel sets to just "not" perform the realignment and ignore those candidates around known sites, it makes sense to use the more accurate set such as 1K gold standard.
Please let me know what you suggest.
Thank you Regards Amin Zia
Hi all, I'm having problems determining the numbers of known and variant sites in my data. A quick count of lines by the "ID" filed in the .vcf file generated by UnifiedGenotyper shows 952 novel and 1813 known (i.e. with rs#) variants. When I use the same .vcf file with the VariantEval tool (specifying the same dbsnp file I used in UG), the output shows 866 novel and 1899 known variants. What might be the reason for this discrepancy, and which numbers are more reliable?
I mapped data against the human reference provided in the GATK_b37_bundle resource bundle and I am now trying to run BQSR using the recommended known variant sets from the same resource bundle.
Upon including the Mills_and_1000G_gold_standard.indels.b37.vcf known variant set I get the following error:
##### ERROR contig knownSites = MT / 16571
##### ERROR contig reference = MT / 16569
The header of the Mills_and_1000G_gold_standard.indels.b37.vcf seems to the indicate that the correct 16569 bp MT version is used for the VCF file
Why does the BQSR tool think that a different version of MT is used for the Mills_and_1000G_gold_standard.indels.b37.vcf ?
I have the same problem with the 1000G_phase1.indels.b37.vcf from the GATK_b37_bundle. Get the same error and the MT contig seems the be the correct one from the vcf header. Only the dbsnp_137.b37.vcf is accepted by the BQSR tool without complaining about a different MT contig.
I am using the latest version of GATK, During the Quality score recalibration I found the following error. The code was as follows: java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R ~/SCZ_data/ref_hg19/hg19sum_upper.fa --DBSNP dbsnp132.txt -I ../output.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile input.recal_data.csv
later i understood that i should use BaseRecalibrator for this new version of GATK, but i am still not sure what to put in the reference file for SNPs with the -knownSites command from where to obtain these vcf files?
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I my_reads.bam \ -R resources/Homo_sapiens_assembly18.fasta \ -knownSites bundle/hg18/dbsnp_132.hg18.vcf \ -knownSites another/optional/setOfSitesToMask.vcf \ -o recal_data.grp
Can you please suggest me what should be done??
Dear GATK team,
Thanks a lot for the new GATK version and GATK forum!
I am trying to use GATK for yeast strains. I do not have files of known sites of SNPs/indels. I understand that the BaseRecalibrator must get such a file. Do you suggest to skip calibration and realignment, or is there another way to go here?