Why didn't the UG or HC call my SNP? I can see it right there in IGV!
Posted in FAQs | Last updated on 2013-08-23 22:02:20

Comments (10)

Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs) and in the HaplotypeCaller (for all variants including indels). So, before you post this issue in our support forum, please do a little bit of investigation on your own, as follows.

To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.

Here is a checklist of questions you should ask yourself:

  • How many overlapping deletions are there at the position?

The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.

  • What do the base qualities look like for the non-reference bases?

Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.

  • What do the mapping qualities look like for the reads with the non-reference bases?

A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

  • Are there a lot of alternate alleles?

By default the UG will only consider a certain number of alternate alleles. This value can be set using the --max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.

  • Are you working with SOLiD data?

SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.

Return to top Comment on this article in the forum