Tagged with #dbsnp
1 documentation article | 0 announcements | 30 forum discussions

Created 2012-07-31 17:50:15 | Updated 2016-01-27 05:56:51 | Tags: known knownsites intermediate dbsnp resource

Comments (6)

1. Notes on known sites

Why are they important?

Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.

In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.

Human genomes

If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.

Non-human genomes

If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.

And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!

Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.

2. Recommended sets of known sites per tool

Summary table

Tool dbSNP 129 dbSNP >132 Mills indels 1KG indels HapMap Omni
RealignerTargetCreator X X
IndelRealigner X X
BaseRecalibrator X X X
(UnifiedGenotyper/ HaplotypeCaller) X
VariantRecalibrator X X X X
VariantEval X

RealignerTargetCreator and IndelRealigner

These tools require known indels passed with the -known argument to function properly. We use both the following files:

  • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)


This tool requires known SNPs and indels passed with the -knownSites argument to function properly. We use all the following files:

  • The most recent dbSNP release (build ID > 132)
  • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

UnifiedGenotyper / HaplotypeCaller

These tools do NOT require known sites, but if SNPs are provided with the -dbsnp argument they will use them for variant annotation. We use this file:

  • The most recent dbSNP release (build ID > 132)


For VariantRecalibrator, please see the FAQ article on VQSR training sets and arguments.


This tool requires known SNPs passed with the -dbsnp argument to function properly. We use the following file:

  • A version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
No articles to display.

Created 2016-04-22 22:22:46 | Updated | Tags: vcf dbsnp vcf-format

Comments (1)

Hi. So I want to take a dbSNP file and highlight SNPs that are found in some BAM files. Instead of filtering these I want to search for these SNPs only. I will then use these SNP positions as markers to trace recombination events that have occured between some strains I am studying. So far I havn't found a good strategy to accomplish this. I was hoping someone over here knows how to parse the dbSNP, and could guide me through it. Alternatively you could use this idea to generate a new tool! Please let me know if you have thoughts, or questions. Thanks! -Keller

Created 2016-01-28 08:48:49 | Updated 2016-01-28 09:12:56 | Tags: fastareference dbsnp gatk mutect2

Comments (4)

I have a cancer rna-seq bam file that I would like to use to call variants using MuTect2, but I ran into a couple of errors that I would appreciate some assistance with. Thank you.

  1. The chr notations in my bam file is absent in the reference fasta (Homo_sapiens_assembly19.fasta). I'd rather not change my bam file header using sed and samtools reheader and rather append chr notation into the reference file. Will this mess up the corresponding snp (00-All.vcf)?

  2. Is there a karytypically sorted Ensembl reference (and its corresponding dbSNP vcf) that I can use for GATK workflow?

  3. The biggest issue is "Error: Discordant contig lengths: read MT LN=16571, ref MT LN=16569" when I use picard's reordersam function. This is because the read was aligned against Ensembl reference. Is there a quick fix to resolve this?

EDIT: I should mention that realigning my seq against GATK's reference is not an option because I don't have the original fastq files available.

Created 2016-01-06 13:01:18 | Updated | Tags: combinevariants haplotypecaller best-practices dbsnp gatk combinegvcfs gvcf

Comments (6)

Hi guys, I have recently jointly called 27 full genome data using GenotypeGVCFs approach. While i was trying to extract some chromosomes from the final file, i got this error The provided VCF file is malformed at approximately line number 16076: Unparsable vcf record with allele *.

I look into the file and I found some of the multi-allellic sites having * as seen in the attached picture.

I feel the problem could be that the program realised that more than one allele is present at that position but could not ascertain which allele. I may be wrong but please what do you think I can do to solve this problem. LAWAL

Created 2015-12-21 15:10:30 | Updated | Tags: realignertargetcreator dbsnp realignment gatk random-chromosomes

Comments (7)

I'm re-aligning some bam files using RealignerTargetCreator. I downloaded the hg38 dbSNP file and hg38 reference genome. however, the dbSNP file contains the canonical chromosomes [1,2...X,Y,MT] while the reference (and my bam file) contain also the random chromosomes.

here's where I get the error: MESSAGE: Input files and reference have incompatible contigs: No overlapping contigs found. 142_hg38.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT] reference contigs = [chr1, chr1_GL383518v1_alt, chr1_GL383519v1_alt, chr1_GL383520v2_alt ... [etc etc etc.....] ... chrX_KI270881v1_alt, chrX_KI270913v1_alt, chrY, chrY_KI270740v1_random, chrM]

so...I would like to keep the random chromosomes in the reference/bam file, so how can I have this run without tossing an error? or is this not possible and I need to keep ONLY the chrs in the dbSNP file?

thanks Seb

Created 2015-12-04 14:12:52 | Updated | Tags: bqsr dbsnp

Comments (1)

We are planing to work on GRCh38, and BQSR requires dbSNP VCF as input. The latest one is v146 in GRCh38 here ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b146_GRCh38p2/VCF/

There are two files in dbSNP, one is ALL_20151104.vcf (3.2G) and another is common_all_20151104.vcf (0.9G). Which one should we use? Should we just take the file as is, or do we need some filtering (say MAF, multiallielic SNP, etc)

Created 2015-11-06 11:07:22 | Updated | Tags: selectvariants variantannotator variantfiltration dbsnp

Comments (5)

Hi team,

Prior to submission to NCBI dbSNP a vcf generated by e.g HaplotypeCaller requires several modifications:

  1. Addition of in-house identifiers. .................................................... done
  2. Exclude if alternate allele is "*" i.e. they are in a deletion. .................................................... I'm assuming this can be done with SelectVariants or FilterVariants.
  3. Exclude if ref or alt allele is greater than 50bp .................................................... Perhaps with SelectVariants or FilterVariants --maxIndelSize 50
  4. Exclude if ref and alt alleles do not have a common leading base. .................................................... Not sure ... removing larger indels won't exclude all of these.
  5. Add VRT (variant type) to Info field .....................................................e.g VRT=1 (for an SNV), VRT=2 for an indel etc. SNPeff doesn't seem to work for this but I could be wrong.

Knowing how to effectively format vcfs between GATK output and NCBI input might be quite useful for many people, and save rather a lot of time.

It would be really useful if the three exclusion criteria could be done using GATK. Is this possible and using what commands ?

I feel as though need to use the GATK variantAnnotator command as well. I'm looking into all of this today, and will post if I get any solutions.


William Gilks

Created 2015-10-09 15:53:28 | Updated | Tags: combinevariants dbsnp drosophila

Comments (5)

Hi Team,

I have two vcfs from D.melanogaster. The first contains in-house variant identifiers, the second contains NCBI-dbSNP variant identifiers. Only the first file contains the genotype data. There are 4 million variants in the first file and 5 million in the second. There is expected to be substantial overlap between the two in terms of what variants are present. How can I replace the in-house variant identifier with the dbSNP identifier for each corresponding variant ?

Created 2015-09-17 07:01:32 | Updated | Tags: dbsnp rat

Comments (2)


I would like to run mutect for Rat. is it essential to provide dbSNP file? Rat dpSNP file does not provide information on which is Somatic and which is germ line. All SAO=0.

Did mutect use germ line SNVs in dpSNP file to filter out germ line mutation from output.??If so, I think Rat dbSNP will not provide helpful information as many somatic mutations might excluded because of ambigius SAO flag

One more thing is that dpSNP file is in rn5 not latest rat assembly rn6.

Thank you.

Created 2015-09-16 13:37:51 | Updated | Tags: haplotypecaller dbsnp

Comments (4)


I am having the following problem: I use the HaplotypeCaller (GATK 3.3.0) for variant calling. To identify variants that are known according to dbSNP, I use the "--dbsnp" statement and define a dbSNP file (vcf file). I thought, that everything would work fine, but by coincidence I observed a (in my eyes really serious) problem: The same call is recognized in the case of one sample, but not in the case of another sample. These are the two important lines of the vcf files that get reported:

17 7579643 . CCCCCAGCCCTCCAGGT C 5066.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=4.819;ClippingRankSum=-1.054;DP=231;FS=78.565;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-0.994;QD=21.93;ReadPosRankSum=-5.473;SOR=1.639;set=variant;EFF=INTRON(MODIFIER||||393|TP53|protein_coding|CODING|ENST00000445888|3|1) GT:AD:DP:GQ:PL 1/1:23,207:230:99:5104,251,0

17 7579643 rs59758982 CCCCCAGCCCTCCAGGT C 2868.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=3.120;ClippingRankSum=0.256;DB;DP=134;FS=1.120;MLEAC=2;MLEAF=1.00;MQ=59.91;MQ0=0;MQRankSum=1.849;QD=21.41;ReadPosRankSum=-1.285;SOR=0.704;set=variant;EFF=INTRON(MODIFIER||||393|TP53|protein_coding|CODING|ENST00000445888|3|1) GT:AD:DP:GQ:PL 1/1:13,121:134:96:2906,96,0

As we exclude known variants for our analysis, it is essential that this step works correctly. Yet, I am pretty insecure what to do no. The variant seems to be well known (according to information on the ncbi homepage). Yet, why was it not identified in the other sample???

It would be great if anyone could help me. Many thanks in advance!


Created 2015-05-27 10:24:19 | Updated | Tags: dbsnp

Comments (3)


The question is in the title. Is there an another recommended build of dbSNP to use with last version on GATK on hg19 ?

Thanks :)

Created 2015-02-12 08:13:28 | Updated | Tags: dbsnp rnaseq singlecell

Comments (4)

In a project I need see the allelic frequency for dbSNPs in the RNAseq data of a single cell. To be precise, given the dbsnp I need 0/0s as well if there is required coverage of reads. SNV calling pipeline normally does not report if it is 0/0. I have come all the way to recalibrated BAM following the RNAseq SNV calling best practice as suggested in GATK site. Help with the command would be highly appreciated.

Created 2015-02-06 09:10:56 | Updated | Tags: dbsnp hapmap 1000g omni mills

Comments (1)

When I tried to call SNP/Indel through the V3.3 GATK, I found a problem, how can I get the following datasets?

True sites training resource: HapMap True sites training resource: Omni Non-true sites training resource: 1000G Known sites resource, not used in training: dbSNP Known and true sites training resource: Mills

Does GATK provide the corresponding vcf files such as "hapmap.vcf","omni.vcf","1000G.vcf""dbsnp.vcf""mills.vcf" ?

Created 2015-02-02 21:24:31 | Updated | Tags: vqsr dbsnp vqslod genotypegvcfs gvcf

Comments (18)

From my whole-genome (human) BAM files, I want to obtain: For each variant in dbSNP, the GQ and VQSLOD associated with seeing that variant in my data.

Here's my situation using HaplotypeCaller -ERC GVCF followed by GenotypeGVCFs: CHROM POS ID REF ALT chr1 1 . A # my data chr1 1 . A T # dbSNP I would like to know the confidence (in terms of GQ and/or PL) of calling A/A, A/T. or T/T. The call of isn't useful to me for the reason explained below.

How can I get something like this to work? Besides needing a GATK-style GVCF file for dbSNP, I'm not sure how GenotypeGVCFs behaves if "tricked" with a fake GVCF not from HaplotypeCaller.

My detailed reason for needing this is below:

For positions of known variation (those in dbSNP), the reference base is arbitrary. For these positions, I need to distinguish between three cases:

  1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1.
  2. We have sufficient evidence to call position n as homozygous reference (0/0) with confidence scores GQ=x2 and VQSLOD=y2.
  3. We do not have sufficient evidence to make any call for position n.

I was planning to use VQSR because the annotations it uses seem useful to distinguish between case 3 and either of 1 and 2. For example, excessive depth suggests a bad alignment, which decreases our confidence in making any call, homozygous reference or not.

Following the best practices pipeline using HaplotypeCaller -ERC GVCF, I get ALTs with associated GQs and PLs, and GT=./.. However, GenotypeGVCF removes all of these, meaning that whenever the call by HaplotypeCaller was ./. (due to lack of evidence for variation), it isn't carried forward for use in VQSR.

Consequently, this seems to distinguish only between these two cases:

  1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1.
  2. We do not have sufficient evidence to call position n as a variant (it's either 0/0 or unknown).

This isn't sufficient for my application, because we care deeply about the difference between "definitely homozygous reference" and "we don't know".

Thanks in advance!


Created 2014-11-04 23:43:03 | Updated | Tags: haplotypecaller dbsnp

Comments (1)


I'm running HaplotypeCaller on whole-genome data. I'm interested in only several thousand specific positions across the genome, all of which are in dbSNP. Some of these are indels, which makes me more inclined to use HaplotypeCaller. However, I want to know the quality/likelihood of the bases called at each position in downstream analysis. So, my goal is to obtain a VCF file containing a call at every dbSNP position, whether or not it's variant with respect to the reference. I need each associated with a likelihood from VQSR, with VQSLOD having the same meaning whether or not the site is variant.

It seems that I could do this with UnifiedGenotyper --output_mode EMIT_ALL_CONFIDENT_SITES. How can I replicate this with HaplotypeCaller? Second, is there a better way to achieve this? Should I not be using HaplotypeCaller at all?

So far, I ran HaplotypeCaller with --emitRefConfidence BP_RESOLUTION --dbsnp dbsnp.vcf, followed by GenotypeGVCFs --includeNonVariantSites --dbsnp dbsnp.vcf. However, the rsIDs aren't used to populate the ID column in the resulting VCF (is this a bug?), so I can't then remove non-dbSNP positions. Secondly, annotations aren't added to positions matching the reference.

Thanks in advance. Douglas

Created 2014-10-08 15:30:10 | Updated | Tags: variantrecalibrator dbsnp variant-recalibration rice

Comments (7)

Hi, I am running GATK on rice libraries and I get the following error with VariantRecalibrator. The command I use is:

java -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T VariantRecalibrator -R genome.fasta -input chr5.vcf -mode SNP -recalFile chr5.raw.snps.recal -tranchesFile chr5.raw.snps.tranches -rscriptFile chr5.recal.plots.R -resource:dbSNP,known=true,training=true,truth=true,prior=6.0 all_chromosomes.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ

This is how the program terminates:

INFO 21:42:35,099 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 21:42:35,102 TrainingSet - Found dbSNP track: Known = true Training = true Truth = true Prior = Q6.0

INFO 21:43:05,104 ProgressMeter - Chr11:11654383 9.83e+06 30.0 s 3.0 s 87.7% 34.0 s 4.0 s INFO 21:43:07,413 VariantDataManager - QD: mean = 27.02 standard deviation = 7.70 INFO 21:43:07,440 VariantDataManager - MQRankSum: mean = -0.62 standard deviation = 3.35 INFO 21:43:07,479 VariantDataManager - ReadPosRankSum: mean = 0.25 standard deviation = 1.63 INFO 21:43:07,510 VariantDataManager - FS: mean = 2.11 standard deviation = 9.67 INFO 21:43:07,525 VariantDataManager - MQ: mean = 47.09 standard deviation = 11.34 INFO 21:43:07,660 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, MQRankSum, ReadPosRankSum] INFO 21:43:07,680 VariantDataManager - Training with 7845 variants after standard deviation thresholding.

INFO 21:43:09,833 VariantRecalibratorEngine - Convergence after 93 iterations! INFO 21:43:09,872 VariantRecalibratorEngine - Evaluating full set of 272372 variants... INFO 21:43:09,884 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 21:43:10,854 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:359) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:139) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

I have read similar problems on GATK forums and based on those, it seems to me that the training set of VCFs is too small for my data. Is that so? If so, can you please tell me how can I fix it? This is for rice and I only have 1 set of known VCFs from dbSNP.

Can you also please confirm if my settings for 'known', 'training' and 'truth' are correct.

Thanks a lot in advance.

Created 2014-07-21 07:39:50 | Updated | Tags: bundle baserecalibrator dbsnp

Comments (5)

Hi, I have a question. Is there any specific reason why GATK Bundle only has dbSNP138 since dbSNP141 has released for a while ? Are you planning to release new bundle about this ? Do you recommend to use the latest version dbSNP141 for base recalibration ? Thanks.

Created 2014-05-14 10:04:29 | Updated | Tags: baserecalibrator dbsnp

Comments (1)

I want to perform base re-calibration using BaseRecalibrator. However, I work on a non-model organism and there is no available SNP database.

Would it be right way if I creat my custom dbSNP file using very stringent criteria (--min_base_quality_score 30) for SNP calling and re-calibrate my data using this file? I also wonder if I will benefit from re-calibration when I use custom dbSNP.vcf, which is produced from the same samples I analyze?

I create dbSNP.vcf using UnifiedGenotyper to save time, while for genotyping after base re-calibration I use HaplotypeCaller. I have 24 samples in this analysis.

I would appreciate your assistance.

Created 2014-01-31 21:16:16 | Updated | Tags: knownsites dbsnp indel-realignment

Comments (1)

Dear GATK team,

Would you please clarify that, based on your experience or the logic used in the realignment algorithm, which option between using dbSNP, 1K gold standard (mills...), or "no known dbase" might result in a more accurate set of indels in the Indel-based realignment stage (speed and efficiency is not my concern).

Based on the documentation I found on your site, the "known" variants are used to identify "intervals" of interest to then perform re-alignment around indels. So, it makes sense to me to use as many number of indels as possible (even if they are unreliable and garbage such as many of those found in dbSNP) in addition to those more accurate calls found in 1K gold-standard datasets for choosing the intervals. After all, that increases he number of indel regions to be investigated and therefore potentially increase the accuracy. Depending on your algorithm logic, also, it seems that providing no known dbase would increase the chance of investigating more candidates of mis-alignment and therefore improving the accuracy.

But if your logic uses the "known" indel sets to just "not" perform the realignment and ignore those candidates around known sites, it makes sense to use the more accurate set such as 1K gold standard.

Please let me know what you suggest.

Thank you Regards Amin Zia

Created 2014-01-17 18:30:18 | Updated 2014-01-17 18:57:38 | Tags: dbsnp

Comments (1)

Question on the dbSNP database. Looks like the SubSNP table is the only location for ancestral_allele and gene_name information for a given snp_id/subsnp_id. However, these fields in SubSNP are very sparsely populated. For example, if you search the dbSNP web page :


you can see the ancestral allele for rs5 is G. However, if you search snp_id=5 in the SubSNP table, both ancestral_allele and gene_name are null:

mysql> select gene_name,ancestral_allele from SubSNP where snp_id=5;
| gene_name | ancestral_allele |
|           |                  |
1 row in set (0.00 sec)

Can anyone explain this discrepancy or know where I can get ancestral_allele or gene_name if not in SubSNP directly ?

Created 2014-01-17 01:32:23 | Updated 2014-01-17 01:34:14 | Tags: dbsnp

Comments (3)

Dear GATK team,

When I annotate the dbsnp id to my vcf, I used the dbsnp 137 vcf file in resource bundle. but, it has some entry in original dbsnp137. This mean it has different entry in dbsnp137 (from UCSC). One example is dbsnp id in MT chromosome. so, I can't get fully annotated dbsnp, when I using the dbsnp 137 in gatk resource bundle.

Created 2013-11-15 05:45:35 | Updated 2013-11-15 05:47:33 | Tags: haplotypecaller dbsnp

Comments (5)

Hi, I used following commands to call variants from exactly the same file by HaplotypeCaller. However, I got different results. The results from case1 are not consistent to case2. In some chromosomes, the numbers of variants in case 1 are more than case 2, but others are less. The differences are only a few variants in each chromosome. Any idea? I supposed they will be the same because just adding --dbsnp information.


${java7} -jar $GATK/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R $reference_genome \ -I $input_file \ -L X \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 50 \ -o vcf_out/chrX.vcf


${java7} -jar $GATK/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R $reference_genome \ -I $input_file \ -L X \ -nct 8 \ --dbsnp $dbsnp \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 50 \ -o vcf_out/chrX.vcf

Created 2013-11-13 18:01:55 | Updated | Tags: dbsnp

Comments (7)

Hi! Since dbSNP (for humans) has a lot of variants one could question if actually are real, especially those from cDNA (mRNA), how can I ignore all SNPs in dbSNP 137 with the moltype "cDNA"?

Im asking because a lot of these SNPs may actually be RNA editing artifacts, correct?

Created 2013-06-14 18:27:30 | Updated 2013-06-14 18:31:04 | Tags: dbsnp

Comments (12)

The doc says "dbSNP is not used in any way for the calculations themselves. --dbsnp binds reference ordered data". Does it mean that the determination of whether a locus is a variant is not influenced by whether that variant is present at dbSNP? what does "--dbsnp binds reference ordered data" mean?

Also why isn't there a --indel option?

Created 2013-04-12 06:30:38 | Updated 2013-04-12 06:32:23 | Tags: mutect dbsnp

Comments (1)

Dear Kristian Cibulskis,

I run the mutect with "--enable_extended_output". I found that some of the mutations with quite good quality got rejected only because of "DBSNP Site".

As you know, the DBSNP database have included some somatic mutations as well. So I am worried about that the true somatic mutations might be filtered out inappropriately. Do you think it's proper to keep the mutations with the failure_reasons of DBSNP_Site?

For example: 7 55248926 CCAxGTG T C ALK1T ALK1N 0 DBSNP UNCOVERED 0 0.077336 0 0.99254 0 17 1 0 12.199345 12.223447 3.617702 12.500284 0 0.444444 0.136996 0.02 -0.078943 9 5 4 150 130 60 60 0 0 TT 2.708276 0 0 0.000333 -2.296657 -0.575725 0 -0.001303 9 9 0 281 0 0.62965 0.983402 (4,1,4,0) 24.5 14 66 8 46,37,12,9 54,62,70,91 0.996932 0.839312 0.39571 -0.002982 -0.323501 -0.320518 0 DBSNP Site REJECT

Thanks a lot! Zhou

Created 2013-02-28 10:10:09 | Updated | Tags: unifiedgenotyper annotation dbsnp

Comments (1)

Hi the GATK team;

I use the UnifiedGenotyper the following way:

java -jar GenomeAnalysisTK-2.1-13-g1706365/GenomeAnalysisTK.jar \
        -R /human_g1k_v37.fasta \
        -T UnifiedGenotyper \
        -glm BOTH \
        -S SILENT \
         -L ../align/capture.bed  \
         -I  myl.bam  \
        --dbsnp broadinstitute.org/bundle/1.5/b37/dbsnp_135.b37.vcf.gz \
        -o output.vcf 

When I look at the generated VCF , the variation 18:55997929 (CTTCT/C) is said to be rs4149608

18 55997929 rs4149608 CTTCT C (...)

but in the dbsnp_135.b37.vcf.gz, you can see that the right rs## should be rs144384654

$ gunzip -c broadinstitute.org/bundle/1.5/b37/dbsnp_135.b37.vcf.gz |grep -E -w '(rs4149608|rs144384654)'
18 55997929 rs4149608 CT C,CTTCT (...)
18 55997929 rs144384654 CTTCT C (...)

does UnifiedGenotyper uses the first rs## it finds at a given position ? Or should I use another method/tool to get the 'right' rs## ?

Thank you,


Created 2013-02-22 07:11:41 | Updated | Tags: unifiedgenotyper dbsnp

Comments (6)

I've spent some time looking at the various documentation and have a few lingering questions for understanding my data a little better. I was hoping you could help to clarify a bit and make sure I am making correct assumptions.

I have a set of 96 samples that we ran targeted sequencing on. I have followed the best practices for bam file processing, etc, and ran Unified Genotyper on all 96 bam files at once using -EMIT ALL SITES. I have snp chip data on all of my samples as well so I have some "truth" to compare to. It looks like around 25% of the SNPs I expected to get (are present in dbSNP and I have chip data for them) are not present in my vcf file. Where did they go - is it normal to not get all of the dbSNP sites back? Is this because the variant site did not have good enough quality of bases to make it a variant? Does Unified Genotyper not emit sites for which all samples are homozygous reference even though it knows it's there because of the dbSNP file (like CASAVA)? I looked at the coverage at these positions and in some cases it was very high (over 100X) so I'm not sure that they could all be bad. I am somewhat new to this sort of data analysis - what are the best tools to use to look at the quality of the bases at a single site to see if this is the culprit? Any other ideas as to what could be causing this?

What are the internal things in Unified Genotyper that make a site not be called?

Would I be able to use the GENOTYPE GIVEN ALLELES option to make Unified Genotyper call all dbSNPs regardless of base quality or Phred Score? If so, do I use the same reference db snp .vcf file that I use for the rest of the steps?

For the variants I do have: If I am interested in the coverage at a site, the AD is the unfiltered depth and the DP is the filtered depth (the sum of the allele depths is the total depth), correct?

FInally, I have not run VSQR on my output. We did targeted sequencing and I'm not sure if the number of bases we did is enough got VSQR to be helpful. What is the minimum number of bases that VSQR is recommended with? I have 96 samples and I know that is enough. We are mostly interested in dbSNP values anyhow. In your opinion, is it reasonable to stop after Unified Genotyper since we are only interested in dbSNP calls or should we do the VSQR step anyhow?

Thank you for all your help and insights. I really appreciate the feedback allowed for in this forum.


Created 2013-02-20 16:06:16 | Updated | Tags: bundle mutect dbsnp cosmic

Comments (55)

I'm having trouble finding the recommended COSMIC and dbSNP file for hg19 to use with MuTect (hg19_cosmic_v54_120711.vcf and dbsnp_132_b37.leftAligned.vcf). I can't find these in any of the bundles on the GATK public FTP site. I see a dbSNP file called dbsnp_132_b37.vcf; is this the same? I don't see any COSMIC file at all. I'm currently using bundle 2.3 for hg19 for the dbSNP files (and the standard indels from 1000G and Mills for indel realignment). Thanks!

Created 2012-12-16 22:27:37 | Updated 2012-12-16 22:28:40 | Tags: haplotypecaller validatevariants variantannotator fastareference vcf dbsnp hg19

Comments (7)

I ran the HaplotypeCaller, VariantAnnotator, and Variant Validatoor on chr3 locations from a human tumor sample.

The HaplotypeCaller command line is:

#Fasta from the gz in the resource bundle

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T HaplotypeCaller \
 -I chrom_bams/286T.chr3.bam \
 -o hapc_vcfs/286T.chr3.raw.vcf 

The VariantAnnotator command line is:

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T VariantAnnotator \
     --dbsnp $dbsnp  --alwaysAppendDbsnpId \
    -A BaseQualityRankSumTest -A DepthOfCoverage \
    -A FisherStrand -A HaplotypeScore -A InbreedingCoeff \
    -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth \
    -A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions \
    -A TandemRepeatAnnotator \
    --variant:vcf hapc_vcfs/286T.chr3.raw.vcf \
    --out varanno_vcfs/286T.chr3.va.vcf

This all works nicely, but I go back and use ValidateVariants just to be sure:

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T ValidateVariants \
   --dbsnp ${dbsnp} \
   --variant:vcf varanno_vcfs/286T.chr3.va.vcf \
    1> report/ValidateVariants/286T.chr3.va.valid.out \
    2> report/ValidateVariants/286T.chr3.va.valid.err &

An issue arises with a rsID that is flagged as not being present in dbSNP.

...fails strict validation: the rsID rs67850374 for the record at position chr3:123022685 is not in dbSNP

I realize this is an error message that generally would not generally qualify as an issue to post to these forums, however it is an error that seems to be generated by the Haplotype caller, illuminated by VariantAnnotator, and caught by the ValidateVariants.

The first 7 fields of the offending line in the 286T.chr3.va.vcf can be found using: cat 286T.chr3.va.vcf | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AAAGAGAAGAGAAGAG        A       1865.98 .

There is a corresponding entry in the dbsnp_135.hg19.vcf file: cat $dbsnp | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AA      A,AAAGAGAAGAG,AAAGAGAAGAGAAGAGAAGAG     .  PASS

My initial guess is that this is caused by a disagreement in the reference and variant fields between the two annotations. From what I can gather the call to the variantcontext function validateRSIDs() has a call to validateAlternateAlleles(). I assume this is what throws the error that is then caught and reported as "...fails strict validation..."

The UCSC genome browser for hg19 does show the specified position to be AA. It seems as thought the HaplotypeCaller simply used a different reference than dbsnp in this case.

The reference file supplied to HaplotypeCaller was the same as to VariantAnnotator and ValidateVariants. I did not supply the dbsnp argument to the HaplotypeCaller as I planned on doing all annotations after the initial variant calling, and the documentation states that the information is not utilized in the calculations. It seems as though this is a difference in between the reference assembly for dbSNP and the the reference supplied by the resource bundle.

My questions are:

  1. Is this really a problem that arises from slightly different reference assemblies?
  2. Is the hg19-1.5 reference fasta different from any other hg19 reference fasta?
  3. Is there at tool that I have missed that would have prevented this error and allowed the pipeline to continue without error?"
  4. Will this strict validation failure cause problems for the VariantRecalibrator?

As it stands, I am simply going to discard the offending lines manually. There are less than twenty in the entire exome sequencing of this particular tumor-normal sequencing. However, it seems like this issue will likely arise again. I will check the dbSNP VCF for places where the reference differs from the sequence in hg19. At least that should give me an estimate of the number of times this will arise and the locations to exclude from the variant calls.

-- Colin

Created 2012-11-04 10:16:50 | Updated 2012-11-05 02:28:44 | Tags: haplotypecaller reducereads dbsnp error reviewedstingexception

Comments (33)

Hello dear GATK Team,

when trying to run Haplotypecaller on my exome files prepared with ReduceReads i get the error stated below. As you can see the newest GATK Version is used. Also UnifiedGenotyper does not produce any errors on te exact same data (90 SOLiD exomes creatted according to Best Practice v4).

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Too many deletions?
    at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:447)
    at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:396)
    at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:392)
    at org.broadinstitute.sting.gatk.walkers.annotator.DepthOfCoverage.annotate(DepthOfCoverage.java:56)
    at org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation.annotate(InfoFieldAnnotation.java:24)
    at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:223)
    at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:429)
    at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:104)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegion(TraverseActiveRegions.java:249)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.callWalkerMapOnActiveRegions(TraverseActiveRegions.java:204)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegions(TraverseActiveRegions.java:179)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:136)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:29)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:74)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-3-gde33222):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?
##### ERROR ------------------------------------------------------------------------------------------

The Command line used (abbreviated):

java -Xmx30g -jar /home/common/GenomeAnalysisTK-2.2-3/GenomeAnalysisTK.jar \
  -R /home/common/hg19/ucschg19/ucsc.hg19.fasta \
  -T HaplotypeCaller \
  -I ReduceReads/XXXXX.ontarget.MarkDups.nRG.reor.Real.Recal.reduced.bam [x90]\
  --dbsnp /home/common/hg19/dbsnp_135.hg19.vcf \
  -o 93Ind_ped_reduced_HC_snps.raw.vcf \
  -ped familys.ped \
  --pedigreeValidationType SILENT \
  -stand_call_conf 20.0 \
  -stand_emit_conf 10.0

Created 2012-09-27 09:20:35 | Updated 2012-10-02 16:33:21 | Tags: variantrecalibrator vqsr non-human dbsnp

Comments (9)

Hello, I have a new sequenced genome with some samples for this specie, I would like to follow the best practices but I don't have a dbsnp or something similar, but could I use the variants from the samples as a dbsnp? for example get the variants that coincide in all my samples and use it as a dbsnp?