Calling non-diploid organisms with UnifiedGenotyper
Posted in Methods and Workflows | Last updated on 2013-01-14 21:17:06


There are 5 comments on this article. To see them or add your own, view this post on the forum


Calling non-diploid organisms with UnifiedGenotyper

New in GATK 2.0 is the capability of UnifiedGenotyper to natively call non-diploid organisms. Three use cases are currently supported:

  1. Native variant calling in haploid or polyploid organisms.
  2. Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
  3. Pooled validation/genotyping at known sites.

In order to enable this feature, users need to set the -ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool). Note that all other UnifiedGenotyper arguments work in the same way.

A full minimal command line would look for example like

java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-I myReads.bam \
-T UnifiedGenotyper \
-ploidy 3 

The glm argument works in the same way as in the diploid case - set to [INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.

Current Limitations

Many of these limitations will be gradually removed in the following weeks as we iron out details and fix issues in the GATK 2.0 beta.

  • Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.

  • Some annotations do not work in non-diploid cases. In particular, current InbreedingCoeff is omitted. Annotations which do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF and Genotype annotations such as PL, AD, GT, etc.

  • The interaction between non-diploid calling and other experimental tools like HaplotypeCaller or ReduceReads is currently not supported.

  • Whereas it's entirely possible to use VQSR to filter non-diploid calls, we currently have no experience with this and can hence offer no support nor best practices for this.

  • Only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.

Users should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.


Return to top

There are 5 comments on this article. To see them or add your own, view this post on the forum