The formula and the description given for the OND annotation seem to be contradictory (see: https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AlleleBalance.php). The formula implies that a true diploid variant would have a non-allele value of zero and therefore have an OND=1. However, the description "reads that support something other than the genotyped alleles (called "non-alleles") will be counted in the OND tag, which represents the overall fraction of data that diverges from the diploid hypothesis." suggests that a higher fraction is more divergent from diploid. Can you please clarify (e.g., confirm the true formula should be 1-alleles/(alleles+non-alleles) and that an ideal diploid variant would have an OND of zero)?
Additionally, we have noticed a lot of missing OND values (not multiallelic or indels). Can you explain when/why these may be missing?
Thanks so much!
Some calls I don't see allele depth (AD), for some I don't see depth (DP), and for some I see neither. Are there scenarios where GATK cannot annotate these?
Using GATK 3.1-1, I seem to be unable to get the "AB" (AlleleBalance) annotation for the calls using the HaplotypeCaller -> GenotypeGVCFs pipeline, and I'm not sure how to get it. Our current pipeline (GATK 2.7-4, UnifiedGenotyper) requires this field to perform filtering, so this annotation is essential for us to upgrade to GATK 3.1.
My current pipeline of commands is as follows:
$ GenomeAnalysisTK-3.1-1 -T HaplotypeCaller -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -I
$ GenomeAnalysisTK-3.1-1 -T GenotypeGVCFs -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -V <GVCF file(s)> -L targets.GRCh37.bed -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -o joint_vcf.vcf.gz
Above, note that if "-A AlleleBalance" is given to GenotypeGVCFs, GATK crashes with a NullPointerException (AlleleBalance.java, line 66).
The command above is heavily adapted from the current pipeline; do you know what I might be doing wrong with the new and improved HaplotypeCaller?
Thanks so much for your help, and if you need any further information, please let me know.
Version 3.1.1. Human normal samples.
I couldnt find AlleleBalance and AlleleBalanceBySample tags in my vcf outputs. Tags are not found even for single variant I tried HaplotypeCaller with -all or directly with -A AlleleBalance or -A AlleleBalanceBySample. Also I tried Variantannotator with -all or -A AlleleBalance or -A AlleleBalanceBySample.
Any help will be apreciated
I have an issue that I'm not sure how to solve. I'm using HaplotypeCaller with a three member family and I have some calls when the HOM/HET call doesn't match with what we expected:
chr1 94487191 . A T 93.13 . AC=1;AF=0.167;AN=6;BaseQRankSum=-0.540;ClippingRankSum=-0.374;DP=87;FS=4.946;MLEAC=1;MLEAF=0.167;MQ=61.11;MQ0=0;MQRankSum=-0.042;QD=7.16;ReadPosRankSum=-0.042;set=GATK GT:AD:GQ:PL 0/0:40,0:99:0,102,2405 0/0:33,0:90:0,90,2215 0/1:11,2:99:124,0,760
In the third sample, the call is 0/1 (HET) but the coverage of each allele is 11 and 2, that means, with just 18% is considered as HET. There is a way to change the Allele Balance to 30%?
The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02
Program Args: -T HaplotypeCaller -R ucsc.hg19.fasta -I S1.bam -I S2.bam -I S3.bam -o OUT.RAW.vcf --genotyping_mode DISCOVERY -stand_call_conf 30.0 -stand_emit_conf 10.0
I'm sequencing the genome of an organism which is a cross between the reference line (with no SNPs) and an individual from an outbred population (with many SNPs). Therefore all of the SNPs in my target organism will be heterozygous. So far I have sequenced three individuals which are crosses and one individual from our reference line.
I understand that the UnifiedGenotyper uses population genetic principles to ascertain genotype but I can't find more information about how this is performed. Thus, I am primarily worried that heterozygotes with strongly asymmetric allele counts in the reads will be called as homozygotes in order to fit in with, say Hard-Wienberg equilibrium.
Is there any chance you could enlighten me on this ? (or direct me to more detailed information on UG mechanism and settings).
Just to let you know the background, my study organism is Drosophila melanogaster. The whole genome of 164Mb is paired-end sequenced on an Illumina. I have so far sequenced one individual from our in-house reference line, and three individuals which are crosses of the reference line with a diverse, out-bred population. Average coverage is 30X. The 'crosses' are hemiclones in which recombination between the parental chromosomes is suppressed. I plan on sequencing 200 hemiclone individuals in which one haplotype will be shared between them (the reference gene) and the other haplotype will be diverse and unique to each line. As expected, I have identified a limited number of mutations in our in-house laboratory reference line compared to that of the assembly.
Any advice on how to best call genotypes in this unorthodox sample would be most appreciated.
We are using the unified genotyper to create a vcf file. In the vcf file we see the info tags;
but when we look at the exported annotations
we only see ABHom not ABHet.