1 documentation article | 0 announcements | 5 forum discussions

Created 2014-07-23 17:38:17 | Updated 2015-07-29 22:05:51 | Tags: haplotypecaller genotype likelihoods genotyping

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

Note that this describes the **regular mode** of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in `-ERC`

modes (`GVCF`

and `BP_RESOLUTION`

) please see the reference confidence model documentation.

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).

Both the HaplotypeCaller and GenotypeGVCFs (but not UnifiedGenotyper) assume that the organism of study is diploid by default, but desired ploidy can be set using the `-ploidy`

argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation. The generalized form of the genotyping algorithm that can handle ploidies other than 2 is available as of version 3.3-0. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered (currently, the maximum "workable" ploidy is ~20 for a max number of alt alleles = 6). Future developments will aim to mitigate those limitations.

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

We use the approach described in [Li2011] to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype **G** given the observed data **D**.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype **G**:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype **G** is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype **G** into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype **G**.

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A, T, and C at a site, the possible genotypes are AA, AT, AC, TT, TC, and CC, and we end up with 6 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record. For details of how these metrics are calculated, please see the variant annotations documentation.

No posts found with the requested search criteria.

Created 2015-08-13 19:29:37 | Updated | Tags: genotyping

Maybe this is an obvious question, but if standard practice is to filter individuals with certain percentage of missing genotypes then would that be somewhat a conundrum since their genotype data was used for joint-genotyping? Would that mean you would have to go back and re-jointgenotype without the individual with excess missing genotypes?

I assume this is really just an issue at the threshold level (ie a few snps to push it over the threshold).

Created 2014-12-17 18:04:27 | Updated | Tags: haplotypecaller gatk error rnaseq genotyping genotyping-mode

Hi, I'm currently trying to use GATK to call variants from Human RNA seq data

So far, I've managed to do variant calling in all my samples following the GATK best practice guidelines. (using HaplotypeCaller in DISCOVERY mode on each sample separately)

But I'd like to go further and try to get the genotype in every sample, of each variant found in at least one sample. This, to differentiate for each variant, samples where that variant is absent (homozygous for reference allele) from samples where it is not covered (and therefore note genotyped).

To do so, I've first used CombineVariants to merge variants from all my samples and to create the list of variants to be genotype ${ALLELES}.vcf

I then try to regenotype my samples with HaplotypeCaller using the GENOTYPE_GIVEN_ALLELES mode and the same settings as before: my command is the following:

******java -jar ${GATKPATH}/GenomeAnalysisTK.jar -T HaplotypeCaller -R ${GENOMEFILE}.fa -I ${BAMFILE_CALIB}.bam
--genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles ${ALLELES}.vcf -out_mode EMIT_ALL_SITES
-dontUseSoftClippedBases -stand_emit_conf 20 -stand_call_conf 20

-o ${SAMPLE}_genotypes_all_variants.vcf
-mbq 25 -L ${CDNA_BED}.bed --dbsnp ${DBSNP}.vc**f

In doing so I invariably get the same error after calling 0.2% of the genome.

java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

because the problem seemed to originate from getAlternateAllele, I tried to play with --max_alternate_alleles by setting it to 2 or 10, without success. I also checked my ${ALLELES}.vcf file to look for malformed Alternate alleles in the region where the GATK crashes (Chr 1, somewhere after 78Mb) , but I couldn't identify any... (I searched for Alternate alles that would not match the following extended regexpr '[ATGC,]+')

I would be grateful for any help you can provide. Thanks.

Hey,

currently, I am working on 454 sequencing data. I am using GATK to call SNPs and indels, which works relatively well for the SNPs with the recommended filtering options. However, I was interested to raise sensitivity, which is why I started experimenting with the QD value. I restarted the whole GATK pipeline (including the genotyping) with the filtering option "QD<0.0" instead of "QD<2.0". As I did not change any other parameters, I was just expecting to see more SNPs being called. However, I came across a list of SNPs, which did contain some new SNPs, but which was also lacking one of the previously called SNPs (it was not even present in the raw HaplotypeCaller output). How could this have happened?

This is an extract from the vcf file containing the SNP that was previously called:

4 106156187 rs17253672 C T 490.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.341;ClippingRankSum=1.022;DB;DP=161;FS=4.193;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.231;QD=3.05;ReadPosRankSum=1.347 GT:AD:DP:GQ:PL 0/1:10,12:22:99:519,0,262

I am very curious to hear whether anyone might have an explanation for my findings. Many thanks in advance!

Dear GATK team,

Could you please help me how to explain genotyping in my vcf file. I have Illumina data and vcf caller was GATK. My variant frequency (Alt variant freq) is 99.7%. DP = 4622 (AD = 16, 4606) - so I would expect that this sample is alternate homozygous. But when I check PL value, which is - PL = 1655,0,323 - after calculating my likelihood -

REF= G ALT= A

```
P(D|GG) = 10 ^ -165.5 = small
P(D|AG) = 10 ^ 0 = 1
P(D|AA) = 10 ^ 32.3 = small
```

we can see it is heterozygous. Can anybody help me how to interpret my result? How it is possible that likelihoods show me heterozygous and coverage and VF show me homozygous?

Here is part of my vcf file:

`chr13 32899193 . G A 1625.01 PASS AC=1;AF=0.5;AN=2;DP=4622;QD=0.35;TI=NM_000059;GI=BRCA2;FC=Silent GT:AD:DP:GQ:PL:VF:GQX 0/1:16,4606:5000:99:1655,0,323:0.997:99`

Thank you for any explanation.

Paul.

Created 2013-07-24 10:00:06 | Updated | Tags: unifiedgenotyper genotyping

For the UnifiedGenotyper, I was wondering if you knew of any data describing the relationship between the number of samples which are simultaneously analysed, and the accuracy of the genotype calling, say for a SNP with a MAF of 0.1, and a coverage of 30X. Presumably this plateaus after a certain number of samples.