Hi GATK team,
I was hoping I could get some insight on determining rate of heterozygosity from a gvcf file. We have three diploid lizard samples. Each was run through our GATK pipeline using HC in GVCF mode with
-ERC GVCF followed by joint genotyping using all three samples.
I want to determine the rate of heterozygosity in each lizard by counting the number of heterozygous sites and dividing by the number of callable sites (i.e not './.') for every position in the genome (whether or not it is a variant).
However after reading a response to a question on the forum http://gatkforums.broadinstitute.org/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf, "Short answer is that you shouldn't be looking at the genotype calls emitted by HC in GVCF mode. Longer answer, the gVCF is meant to be only an intermediate and the genotype calls are not final"
I am note sure this is the correct way to count heterozygous sites.
From the HC gvcf file, could I extract the number of callable sites from the GCVFBlocks and variant entries in the file to get the total number of callable sites (excluding './.' entries)? Then count the number of heterozygouse genotypes from the joint genotyping gvcf output?
java -Xmx3g -jar /home/dut/bin/GATK/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ -ERC GVCF \ -R reference.fa \ -I sample001.bam \ -stand_call_conf 30 \ -stand_emit_conf 30 \ -mbq 17 \ -o sample001.rawVAR.vcf)
java -Xmx3g -jar /home/dut/bin/GATK/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ --includeNonVariantSites \ -ploidy 2 \ -R reference.fa \ --variant sample8450.rawVAR.vcf \ --variant sample003.rawVAR.vcf \ --variant sample001.rawVAR.vcf \ -o jg_sample001_sample003_sample8450.vcf
We are using GATK version 3.3.0
Any suggestions are appreciated! Thank you for your time.
I am currently using GATK (GenomeAnalysisTK-3.3-0) to assess the level of heterozygosity in individual lizards (diploid). There is no known set SNPs or INDELs, so I have been using the bootstrapping protocol recommended in the best practices. Based on the recalibration plots, it appears that the before and after quality scores have converged, however the total number of filtered SNPs continues to change for each iteration of recalibration.
Should we expect the number of filtered SNPs to stop changing as we continue to bootstrap? Or should I be concerned about my dataset...
Some basic information: We are using the unified genotyper for variant calling. We have now completed 10 rounds of recalibration. We are counting the number of SNPs that pass filtering
If I count the number of SNPs in unix: for file in *.recode.vcf; do echo $file; cat $file | grep -v "#" | wc -l; done
all_SNPs_4.recode.vcf 922016 all_SNPs_5.recode.vcf 929913 all_SNPs_6.recode.vcf 923369 all_SNPs_7.recode.vcf 922344 all_SNPs_8.recode.vcf 923250 all_SNPs_9.recode.vcf 924366
I can provide further information if it would be helpful.
Thanks, Duncan Tormey firstname.lastname@example.org
Hi folks I had a somewhat strange question to ask people and get some opinion and see if this makes sense. I've followed the bestpractice guide and now at a stage where I'm looking at my SNP vcf file any interpret the data. I've sequenced a non-model Drosophila specifically I've pooled a couple of progeny from a single line and sequenced the genome. I think this pooling is biting me back since the het sites I'm getting back are not exactly following the traditional definiton of hets and homozygous variation (ie. homozygosity (0%/100%) or heterozygosity (50%)) For example the het site found in the line below follows the traditional definition of heterozygosity (50%ref and 50%alt):
genome 1428318 . G A 392.68 PASS AC=1;AF=0.071;AN=14;BaseQRankSum=0.207;ClippingRankSum=-8.860e-01;DP=259;FS=3.192;MLEAC=1;MLEAF=0.071;MQ=60.00;MQ0=0;MQRankSum=0.697;QD=12.27;ReadPosRankSum=1.11 GT:AD:DP:GQ:PL 0/0:.:32:64:0,64,1395 0/0:.:43:99:0,120,1800 0/0:.:23:62:0,62,990 0/1:16,16:32:99:426,0,422 0/0:.:28:67:0,67,1080 0/0:.:56:99:0,99,1800 0/0:.:45:99:0,120,1800
But something like this site doesn't look like the traditional het site:
genome 1430207 . C T 608.68 PASS AC=1;AF=0.071;AN=14;BaseQRankSum=1.57;ClippingRankSum=-2.430e-01;DP=322;FS=1.962;MLEAC=1;MLEAF=0.071;MQ=60.00;MQ0=0;MQRankSum=0.172;QD=5.85;ReadPosRankSum=0.799 GT:AD:DP:GQ:PL 0/0:.:21:60:0,60,900 0/1:81,23:104:99:642,0,2872 0/0:.:40:62:0,62,1485 0/0:.:28:66:0,66,990 0/0:.:45:71:0,71,1620 0/0:.:24:64:0,64,990 0/0:.:60:92:0,92,1800
So from here is there any way to measure this difference? One thing I was thinking was for all het sites I was wondering if taking the ratio of the numbers reported for the PL field would be a statistic to look at the degree of heterozygosity per site. Specifically the ratio of PL for 0/0 and 1/1 genotype would be ~1 if there are 50% ref and 50% alt reads whereas a deviation from 1 would indicate otherwise. I was planning to plot this out and see if the ratios cluster around 1 or if it looks different. Would this be a wrong way of using the PL field and are there any other ways to see the so called "degree" of heterozygosit per site? I'm sorry if I sound too confusing.
If I specify both -inputPrior and -hets, which one will UnifiedGenotyper use to determine the prior? Is --heterozygosity used for anything other than prior specification?
I am calling variants in exome resequencing of a sample of inbred barley lines using the HaplotypeCaller. Initially, I set the heterozygosity parameter to 0.008, which is the average coding sequence diversity (Tajima's pi) in cultivated barley. In the 60MB capture, I call about 500,000 single nucleotide variants, which seems appropriate given our previous information.
Our collaborators asked us to re-run the variant calling with the default parameters, to see what effect changing the heterozygosity parameter has. When I do, the HaplotypeCaller calls almost one million variants, over twice what we would expect.
Why would this be? Intuitively, I would expect that decreasing the heterozygosity parameter by almost an order of magnitude would under-call variants. Instead, it has the opposite effect.
I am using GATK 2.8-1. Thank you!
Dear developers, I have looked into the forum for similar questions but I couldn't find any. I have several cases in which I get homozygous calls in positions with ~50% of reads calling the mutation (or less), please find here an example of a position validated by Sanger (as het) in which I have a high coverage (~400 reads in total) here is the results with UnifiedGenotyper:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT L1 chr4 998101 . C T 7296.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=16.344;DP=411;Dels=0.00;FS=0.000;HaplotypeScore=11.8258;MLEAC=2;MLEAF=1.00;MQ=59.94;MQ0=0;MQRankSum=-1.436;QD=17.75;ReadPosRankSum=-0.062 GT:AD:DP:GQ:PL 1/1:203,208:411:99:7325,581,0
can you help me in this case? I am really puzzled.
I have run it with v2.2, 2.4 and 2.5 and I always had the same genotype call (the excerpt here is from the 2.5, I have downloaded it just to check it was not caused by a bug already fixed). It's not a downsampling issue since I have high coverage samples (HaloPlex) and used higher dcov than the default.
I have a VCF containing 7.4m SNPs over 70 individuals from an F2 intercross, called by the UnifiedGenotyper v2.3.6. I am trying to set appropriate thresholds for filtering these SNPs. The attached plots summarise the individual calls from this data set, with depth on the x-axis, genotype quality on the y-axis and frequency of particular DP+GQ combinations shown in greyscale. The first plot shows 0/1 (heterozygote) calls, the second shows 0/0 (homozyote) calls (the 1/1 plot looks similar to the 0/0 plot).
The homozygote plot shows a clear relationship between minimum depth and maximum GQ; it is impossible to get high GQs at low depth. However, this is not the case for heterozygotes. This makes intuitive sense to me - at low depth, one cannot be sure that a call really is homozygote; perhaps the other allele simply hasn't been sequenced. But we can have more confidence in a low depth heterozygote, as both alleles have been seen.
However, I am wondering what your recommendations for best practice are here; do you recommend using the same GQ thresholds for homozygote and heterozygote calls, or different thresholds? If the same thresholds, it seems like there will be a bias at low coverage; many (quite possibly real) homozygote calls will be excluded, which will make it appear that there is an excess of heterozygosity in low coverage individuals.
Also, there seems to be a periodicity in the homozygote (but not the heterozygote) GQ values; GQ values divisible by three have a different distribution to other GQ values. I assume this doesn't affect the results too much (after all, the scale is fairly arbitrary in the first place) but I'd be interested to know what causes this, if it is known.
Thanks for your help,
I'm sequencing the genome of an organism which is a cross between the reference line (with no SNPs) and an individual from an outbred population (with many SNPs). Therefore all of the SNPs in my target organism will be heterozygous. So far I have sequenced three individuals which are crosses and one individual from our reference line.
I understand that the UnifiedGenotyper uses population genetic principles to ascertain genotype but I can't find more information about how this is performed. Thus, I am primarily worried that heterozygotes with strongly asymmetric allele counts in the reads will be called as homozygotes in order to fit in with, say Hard-Wienberg equilibrium.
Is there any chance you could enlighten me on this ? (or direct me to more detailed information on UG mechanism and settings).
Just to let you know the background, my study organism is Drosophila melanogaster. The whole genome of 164Mb is paired-end sequenced on an Illumina. I have so far sequenced one individual from our in-house reference line, and three individuals which are crosses of the reference line with a diverse, out-bred population. Average coverage is 30X. The 'crosses' are hemiclones in which recombination between the parental chromosomes is suppressed. I plan on sequencing 200 hemiclone individuals in which one haplotype will be shared between them (the reference gene) and the other haplotype will be diverse and unique to each line. As expected, I have identified a limited number of mutations in our in-house laboratory reference line compared to that of the assembly.
Any advice on how to best call genotypes in this unorthodox sample would be most appreciated.
I've been experiencing some apparent errors with HaplotypeCaller that I think could be related to how it chooses candidate haplotypes when performing multi-sample calling. Please see the example files I've uploaded to the server (cooketho_20130103.tar.gz). For instance if you look at position 3511 in sample 2, there are 14 non-reference reads and 0 reference reads. When HaplotypeCaller is run with just this sample, it calls this locus homozygous non-reference, which seems to me to be the correct behavior. But when run with all 14 samples, it doesn't call a SNP at this locus. Repeating the run in debug mode shows that the (immediate) cause is that there were 11 candidate haplotypes found, and not a single one of them had the non-reference allele at position 3511. Why?
I came across an earlier post that suggested in some cases increasing the
--minPruning value can be of use, but I tried this to no avail.
My organism is a plant, and is is considerably more heterozygous than human, but changing the
--heterozygosity value did not appear to help either. Double check me on this if you like.
Can you please suggest a fix, or perhaps release some documentation on how HaplotypeCaller selects candidate haplotypes?
P.S. Any idea of when the source will be released to the public, or when a more comprehensive manual will be released? Would be very helpful for figuring out what is going on in cases like this.