Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.
In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.
If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.
If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.
And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!
Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.
|Tool||dbSNP 129||dbSNP >132||Mills indels||1KG indels||HapMap||Omni|
These tools require known indels passed with the
-known argument to function properly. We use both the following files:
This tool requires known SNPs and indels passed with the
-knownSites argument to function properly. We use all the following files:
These tools do NOT require known sites, but if SNPs are provided with the
-dbsnp argument they will use them for variant annotation. We use this file:
For VariantRecalibrator, please see the FAQ article on VQSR training sets and arguments.
This tool requires known SNPs passed with the
-dbsnp argument to function properly. We use the following file:
Dear GATK team,
I'd like to learn what files I should use for indel realignment and BQSR from hg38 bundle? (I read the manual on this topic -- https://broadinstitute.org/gatk/guide/article?id=1247 -- but just would like to be sure):
1) Am I right that for indel realignment I should use Mills_and_1000G_gold_standard.indels.hg38.vcf and 1000G_phase1.snps.high_confidence.hg38.vcf.gz ?
2) Am I right that for BQSR I should use Mills_and_1000G_gold_standard.indels.hg38.vcf , 1000G_phase1.snps.high_confidence.hg38.vcf.gz , and dbsnp_144.hg38.vcf ?
3) Are there any other files with known sites I should use for indel realignment and BQSR?
I have what seems like a simple question, but I'm a little confused on the proper steps to take:
I want to run GenotypeGVCFs, but force a call at all known sites from a previous curated VCF records file, even if the site is no-called (./.).
I've search through the Forum boards, but have not found a clear way to do this. Any help would be appreciated.
Hello, I have been using GATK for a while now, but until now I've been analyzing Human samples and the current analysis is of Arabidopsis Thaliana data. Since I do not have databases of known indels and SNPs for this algorithm, I am following the suggested workflow without known sites. I only have 2 samples in the analysis, a W.T parental strain and a sample which consists of a pool of 50 plants that underwent EMS mutagenesis. This treatment causes a large number of mutations, when each of the 50 plants in the pool can present different variations and the goal in the experiment is to find the one strong common homozygous mutation to the mutated plants, which is not present in the parental strain. Since the data is a bit different than any other data that I had worked with, I would like to know if the standard workflow (running indel realignment without known sites, running HC, filtering the high confidence SNPs and than BQSR and HC again) is also recommended in this case and if so, should I apply any different cutoffs to obtain the high confidence SNPs set? Should I use the variants found in both samples to create the high confidence SNPs file? (since the mutagenesis sample will consist of a lot of mutation with a wide range of frequencies, that will not be present in the parental strain at all)
Thank you very much, Olga karinsky
Hi, i read the entry about how to do BQSR without a knownSNP file and have some uncertainties how to apply it to RNAseq data.
I am actually calling SNPs from RNA-seq data on a draft genome of a non-model organism and were wondering what best practice might be to do so (must sound like a nightmare to you working with human data :smile: ).
I can think of following workflow for each of the RNA-seq samples:
1. best practice SNPcalling for RNA reads with HaplotypeCaller 2. filter variants for "high quality" (-window 25 -cluster 3 --filterExpression "MQ < 30.0" --filterExpression "QD < 2.0" --filterExpression "DP < 5" 3. select for PASS SNPs and biallelic SNPs (as sample is diploid) 4. use the selected SNPs as knownSNPs to do BQSR 5. run Haplotypecaller again on the recalibrated bam 6. go nuts with the gained vcf file... =)
Should i include heterozygous SNPs to generate the BQSR-recalibration file? Would you agree on that workflow or alter the filters ( i know filtering for depth is not a good thing to do but for RNA-seq i think its good to have some minimal coverage of a site)
Comments and recommendations are very welcome, Thank you,
I would like to know about the -knownSites of the BaseRecalibrator. How to get the latest available list for this? How to use this, when we want to identify the unknown variant sites ? What are the steps need to be used instead of the "Quality score recalibration-CountCovariates, TableRecalibration" which were available in the older versions of GATK.
Hi, I have a general question about the importance of known VCFs (for BQSR and HC) and resources file (for VQSR). I am working on rice for which the only known sites are the dbSNP VCF files which are built on a genomic version older than the reference genomic fasta file which I am using as basis. How does it affect the quality/accuracy of variants? How important is to have the exact same build of the genome as the one on which the known VCF is based? Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species? In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file? If the answers to the above questions are too detailed, can you please point me to any document, if available, which might address this issue?
More fun with mouse known site data! I'm using the Sanger MGP v3 known indel/known SNP sites for the IndelRealigner and BQSR steps.
I'm working with whole-genome sequence; however, the known sites have been filtered for the following contigs (example from the SNP vcf):
##fileformat=VCFv4.1 ##samtoolsVersion=0.1.18-r572 ##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa ##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ) ##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus) ##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2) ##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5) ##contig=<ID=1,length=195471971> ##contig=<ID=10,length=130694993> ##contig=<ID=11,length=122082543> ##contig=<ID=12,length=120129022> ##contig=<ID=13,length=120421639> ##contig=<ID=14,length=124902244> ##contig=<ID=15,length=104043685> ##contig=<ID=16,length=98207768> ##contig=<ID=17,length=94987271> ##contig=<ID=18,length=90702639> ##contig=<ID=19,length=61431566> ##contig=<ID=2,length=182113224> ##contig=<ID=3,length=160039680> ##contig=<ID=4,length=156508116> ##contig=<ID=5,length=151834684> ##contig=<ID=6,length=149736546> ##contig=<ID=7,length=145441459> ##contig=<ID=8,length=129401213> ##contig=<ID=9,length=124595110> ##contig=<ID=X,length=171031299> ##FILTER=<ID=BaseQualBias,Description="Min P-value for baseQ bias (INFO/PV4) "> ##FILTER=<ID=EndDistBias,Description="Min P-value for end distance bias (INFO/PV4) [0.0001]"> ##FILTER=<ID=GapWin,Description="Window size for filtering adjacent gaps "> ##FILTER=<ID=Het,Description="Genotype call is heterozygous (low quality) "> ##FILTER=<ID=MapQualBias,Description="Min P-value for mapQ bias (INFO/PV4) "> ##FILTER=<ID=MaxDP,Description="Maximum read depth (INFO/DP or INFO/DP4) "> ##FILTER=<ID=MinAB,Description="Minimum number of alternate bases (INFO/DP4) "> ##FILTER=<ID=MinDP,Description="Minimum read depth (INFO/DP or INFO/DP4) "> ##FILTER=<ID=MinMQ,Description="Minimum RMS mapping quality for SNPs (INFO/MQ) "> ##FILTER=<ID=Qual,Description="Minimum value of the QUAL field "> ##FILTER=<ID=RefN,Description="Reference base is N "> ##FILTER=<ID=SnpGap,Description="SNP within INT bp around a gap to be filtered "> ##FILTER=<ID=StrandBias,Description="Min P-value for strand bias (INFO/PV4) [0.0001]"> ##FILTER=<ID=VDB,Description="Minimum Variant Distance Bias (INFO/VDB) ">
When I was trying to use these known sites at the VariantRecalibration step, I got a lot of walker errors saying that (I paraphrase) "it's dangerous to use this known site data on your VCF because the contigs of your references do not match".
However, if you look at the GRCm38_68.fai it DOES include the smaller scaffolds which are present in my data.
So, my question is: how should I filter my bam files for the IndelRealigner and downstream steps? I feel like the best option is to filter on the contigs present in the known site vcfs, but obviously that would throw out a proportion of my data.
Thanks very much!
I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.
I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:
INFO 15:12:09,747 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 INFO 15:12:09,751 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:12:09,752 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz -o DUK01M.indel.intervals.list INFO 15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09 INFO 15:12:09,758 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,759 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz to be VCF INFO 15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT INFO 15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:12:10,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:12:10,439 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 INFO 15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz as a tabix indexed file INFO 15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation INFO 15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files INFO 15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy INFO 15:12:12,334 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
I've checked the indel interval lists for my samples and they do all appear to contain different intervals.
However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows:
`##### ERROR ------------------------------------------------------------------------------------------
Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.
My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.
Thanks very much!
Dear GATK team,
Would you please clarify that, based on your experience or the logic used in the realignment algorithm, which option between using dbSNP, 1K gold standard (mills...), or "no known dbase" might result in a more accurate set of indels in the Indel-based realignment stage (speed and efficiency is not my concern).
Based on the documentation I found on your site, the "known" variants are used to identify "intervals" of interest to then perform re-alignment around indels. So, it makes sense to me to use as many number of indels as possible (even if they are unreliable and garbage such as many of those found in dbSNP) in addition to those more accurate calls found in 1K gold-standard datasets for choosing the intervals. After all, that increases he number of indel regions to be investigated and therefore potentially increase the accuracy. Depending on your algorithm logic, also, it seems that providing no known dbase would increase the chance of investigating more candidates of mis-alignment and therefore improving the accuracy.
But if your logic uses the "known" indel sets to just "not" perform the realignment and ignore those candidates around known sites, it makes sense to use the more accurate set such as 1K gold standard.
Please let me know what you suggest.
Thank you Regards Amin Zia
Hi all, I'm having problems determining the numbers of known and variant sites in my data. A quick count of lines by the "ID" filed in the .vcf file generated by UnifiedGenotyper shows 952 novel and 1813 known (i.e. with rs#) variants. When I use the same .vcf file with the VariantEval tool (specifying the same dbsnp file I used in UG), the output shows 866 novel and 1899 known variants. What might be the reason for this discrepancy, and which numbers are more reliable?
I mapped data against the human reference provided in the GATK_b37_bundle resource bundle and I am now trying to run BQSR using the recommended known variant sets from the same resource bundle.
Upon including the Mills_and_1000G_gold_standard.indels.b37.vcf known variant set I get the following error:
##### ERROR contig knownSites = MT / 16571
##### ERROR contig reference = MT / 16569
The header of the Mills_and_1000G_gold_standard.indels.b37.vcf seems to the indicate that the correct 16569 bp MT version is used for the VCF file
Why does the BQSR tool think that a different version of MT is used for the Mills_and_1000G_gold_standard.indels.b37.vcf ?
I have the same problem with the 1000G_phase1.indels.b37.vcf from the GATK_b37_bundle. Get the same error and the MT contig seems the be the correct one from the vcf header. Only the dbsnp_137.b37.vcf is accepted by the BQSR tool without complaining about a different MT contig.
I am using the latest version of GATK, During the Quality score recalibration I found the following error. The code was as follows: java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R ~/SCZ_data/ref_hg19/hg19sum_upper.fa --DBSNP dbsnp132.txt -I ../output.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile input.recal_data.csv
later i understood that i should use BaseRecalibrator for this new version of GATK, but i am still not sure what to put in the reference file for SNPs with the -knownSites command from where to obtain these vcf files?
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I my_reads.bam \ -R resources/Homo_sapiens_assembly18.fasta \ -knownSites bundle/hg18/dbsnp_132.hg18.vcf \ -knownSites another/optional/setOfSitesToMask.vcf \ -o recal_data.grp
Can you please suggest me what should be done??
Dear GATK team,
Thanks a lot for the new GATK version and GATK forum!
I am trying to use GATK for yeast strains. I do not have files of known sites of SNPs/indels. I understand that the BaseRecalibrator must get such a file. Do you suggest to skip calibration and realignment, or is there another way to go here?