Tagged with #variant-discovery
2 documentation articles | 0 announcements | 0 forum discussions


Comments (8)

This article is part of the Best Practices workflow documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.

Many variant callers specialize in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) have to call them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels.

In addition, the HaplotypeCaller is able to estimate the probability that a given site is non-variant. This is very useful when you want to distinguish between cases where no variant was called because the evidence suggests that the site is non-variant, as opposed to cases where no call could be made either way because there was no data available. This capability, conferred by the reference confidence model, is used in the Best Practices workflow to produce a gVCF (short for genomic VCF) for each sample in a cohort.

Comments (0)

This article is part of the Best Practices workflow document. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.

Once you've pre-processed your data according to our recommendations, you are ready to undertake the variant discovery process, i.e. identify the sites where your data displays variation relative to the reference genome, and calculate genotypes for each sample at that site. Unfortunately some of the variation you observe is caused by mapping and sequencing artifacts, so the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). We have found that it is very difficult to reconcile these objectives in a single step, so instead we decompose the variant discovery process into separate steps: variant calling (performed per-sample), joint genotyping (performed per-cohort) and variant filtering (also performed per-cohort). The first two steps are designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

Notes on which tools to use

  • The GATK includes two variant calling tools, HaplotypeCaller and UnifiedGenotyper. The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper, and we recommend using HaplotypeCaller in all cases, with only a few exceptions (see FAQs below).
  • For best results, the variant filtering should be done with the Variant Quality Score Recalibration (VQSR) tools. In some cases (small datasets, non-human organisms) this is not possible and must be done by applying hard filters instead (see FAQs below).
No posts found with the requested search criteria.
No posts found with the requested search criteria.