Hi - I have a question regarding Individual vs Group SNP calling. I am using Unified Genotyper, call the SNPs groupwise, are calling SNPs and indels separately, data is high coverage (>20). Since my individuals are mostly from one population (human) with only few comparative samples I wondered how much the group calling affect the SNP calling in the few comparative samples. Would there not be a bias towards SNPs in the dominant group? In such a case would it be better to do individual sample SNP calling to give unbiased results?
I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant.
However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged.
I think I recall another post having a similar issue due to multithreaded use of select variants
APologies for not commenting on this post instead as I had already posted this prior to seeing the other post!
Hi, using unified genotyper with multiple samples (different lines of Medicago truncatula showing polymorphism), I would like to get a file giving me, for each sample, the alleles at each SNP position (SNP positions identified for all samples). Is it possible? I didn't find any option for this, and it would be dificult to make SNP calling for each sample and then create a new file with all of them, because SNPs positions and number are different according to the samples. Thanks in advance,
I am working with de novo assembly and I aligned reads to the scaffolds and called SNPs across the samples. But the result I received after Calling for all confident sites (using Unified Genotyper) is quite unusual to me.
This is what I get:
INFO 22:57:01,315 UnifiedGenotyper - Visited bases 1073801052
INFO 22:57:01,339 UnifiedGenotyper - Callable bases 977317465
INFO 22:57:01,349 UnifiedGenotyper - Confidently called bases 575018059
INFO 22:57:01,357 UnifiedGenotyper - % callable bases of all loci 91.015
INFO 22:57:01,366 UnifiedGenotyper - % confidently called bases of all loci 53.550
INFO 22:57:01,374 UnifiedGenotyper - % confidently called bases of callable loci 58.836
INFO 22:57:01,382 UnifiedGenotyper - Actual calls made 698376287
My question is how the number of Actual calls made is greater than the number of Confidently called bases as I always expect Confidently called bases would be greater or equal to Actual calls made but not less.
I would appreciate your help. Many Thanks. sarkar
I am using GATK through the Galaxy main server to analyze variations from whole-genome re-sequencing of various samples of non-model species (nematodes worms). I would like to know whether it is possible to have with Galaxy's GATK tools a kind of pileup (base per base or intervall, like .bed) of genome indicating specifically which base where callable or not by Unified Genotyper (UG), such as "CallableLoci". The log & metrics files generated by UG in Galaxy give the general statistics of callable loci, but there is no such a file giving a detailed information of the eligibility of each base.
In the same kind of idea, I would like to get a per-locus-depth of coverage (which can partially help answering my previous question, although it does not take into account all the filters used by UG such as base quality, mapping quality, etc.). This tool is available on Galaxy. However, I am performing 3 rounds of BQSR to get my final vcf file. Shall I calculate the depth of coverage using the first BAM file before BQSR or the last recalibrated BAM file obtained in the 3rd round of BQSR? I don't think BQSR alter the coverage score, so I would say this shouldn't matter. Am I right?
Thanks in advance for help and advices, Fabrice
I am looking at the distribution of quality scores (QUAL column of VCF) emitted by the Unified Genotyper v1.4 (old version, sorry :)) for SNPs called on 769 samples with 12x coverage HiSeq data. The median quality score is at ~Q750 and the mean at ~Q28000. I was just wondering if you could comment on how well calibrated these quality scores are and whether/how the number of samples and/or coverage would affect them?
Thanks in advance, Laurent
Hi, I'm working with the UnifiedGenotyper walker and I have detected strange values for the QUAL field of some VCF entries in the output files.
Sometimes in the VCF output file, the QUAL value for different vcf entries it is repited, for example, the QUAL values 32729.73 or 2147483609.73 usually appear in the output and not only in my files, because when I have searched on the GATK forum, this value appears in other users posted vcf files related to other questions.
I have tested it with several GATK versions, and in the latests versions these QUAL numbers are extremely high and I have also detected that the value doesn't correspond with the relationship QUAL~QD*DP.
Another strange thing is that for other QUAL values in the VCF file, it is very common that decimal part begins with a seven. i.e : 32729.73
Have you detected this? Is some kind of bug?
I look forward to your response.
I copy some VCF entries with different GATK versions:
Last version (2.7-2):
chr13 32907535 . C CT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.483;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.88;MQ0=0;MQRankSum=13.620;QD=33.72;RPA=11,12;RU=T;ReadPosRankSum=-12.065;STR GT:AD:DP:GQ:PL 0/1:1386,1037:2453:99:9301,0,13130 chr13 32907589 . G GT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=6.991;DP=999;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.93;MQ0=0;MQRankSum=26.910;QD=28.67;RPA=7,8;RU=T;ReadPosRankSum=-2.595;STR GT:AD:DP:GQ:PL 0/1:1306,1142:2469:99:14116,0,16944
chr13 32907535 . C CT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.106;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.81;MQ0=0;MQRankSum=14.984;QD=29.49;RPA=11,12;RU=T;ReadPosRankSum=-9.803;STR GT:AD:DP:GQ:PL 0/1:1261,1038:2453:99:7901,0,13152 chr13 32907589 . G GT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=6.976;DP=998;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.74;MQ0=0;MQRankSum=25.865;QD=31.47;RPA=7,8;RU=T;ReadPosRankSum=-0.572;STR GT:AD:DP:GQ:PL 0/1:1184,1142:2469:99:13365,0,16796
chr13 32907535 . C CT 32729.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.023;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.75;MQ0=0;MQRankSum=-3.054;QD=32.73;RPA=11,12;RU=T;ReadPosRankSum=3.137;STR GT:AD:DP:GQ:PL 0/1:0,29:2453:99:7901,0,13152 chr13 32907589 . G GT 32729.73 . AC=1;AF=0.500;AN=2;DP=999;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.71;MQ0=0;QD=32.76;RPA=7,8;RU=T;STR GT:AD:DP:GQ:PL 0/1:0,0:2469:99:13365,0,16796
I've got some question about some of the GATK tools and practices (and hope its okay to post them into a single thread)
Is there any additional background information about the UnifiedGenotyper available , especially in case of multi sample calling ? How this works in a bit more detailed way ? So far I could only find the slides from the last GATK Workshop. But if I remember this correctly, during the presentation it was mentioned one could ask if more information is required (unfortunately I wasn't at the workshop ;))
What's the definition of the Base Score Accuracy (Base Quality Score Recalibration Plots) ? Am I correct that this specifies how well the observed quality scores match the expected (empirical) quality scores ? I think I read it somewhere but couldn't find it any more.
I've read that the way to validate(check what and how much was done) the Realign around Indels step, is to count/search for the OC Tags in the alignment file. Is there any fast way to do so ? Or do I have to convert the BAM into a SAM and count by running through lines of the alignment ?
Fortunately there exist the "Recommended sets of known sites per tool" which I used so far. But is there any explanation why those sets are recommended ?
Tanks a lot !
I am using GATK unified genotyper for SNP and indel calling. My code is as follows:
$javaPath/java -Xmx4g -jar $gatkPath/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R $humanRefSequence \ -I "$sampleID".recal.bam \ -B:dbsnp,vcf $humanDbsnpFile \ -nt 8 -glm BOTH \ -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 250 \ -l INFO \ -A AlleleBalance -A DepthOfCoverage -A FisherStrand \ -log "$sampleID".GATKvariants.log -o "$sampleID".GATKvariants.raw.vcf
However, after running few minutes it's showing the following error:
I looked around and couldn't find a proper solution for it. I got stuck with for quite awhile. Can anyone please tell me what is the main reason for this error and how I can overcome it?
Many thanks in advance.
Hello, I am using the latest GATK Unified Genotyper (UG) software for my BWA aligned reads (paired-end). 1. BWA: default parameters 2. markDuplicates (PICARD) and realignment (GATK) 3. UnifiedGenotyper default values except: stand_call_conf 30.0, stand_emit_conf 10.0 When I use the ReducedBam with UG, I get 2,247,468 SNPs When I use the Bam without ReducedReads UG gives me 2,245,966 SNPs
I used BEDTOOLS to compare both files: 2,229,901 shared SNPs 17,567 only identified with the ReducedReads Bam 16,065 only identified with the non ReducedReads Bam
Do you have an idea, what happend here? Many thanks in advance
I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?
I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.
Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.