I have used GATK for human. Now i have a need to call variants from bacteria. In case of human, known variants is fed in Base quality recalibration step, however, i do not have any know variants for bacteria, can i simply skip the step of Base quality recalibration? I gave a try by skipping it, but i got exceptionally huge number of SNPs. Are there any strict requirements for bacteria variant call?
I'm trying to call variants on metagenomic data using the UnifiedGenotyper. I know that the diploid genotype calls & likelihoods will not be valid since my data is not diploid, but I want to use the vcf output so sum up base frequencies at detected variant loci.
I mapped 100+ samples (each being ~2 Illumina GA2 lanes of data that after host filtering usually contain about 20-40 million reads per sample) against a database of 671 bacterial reference sequences (and each reference can be in multiple parts, so I probably have 10s of thousands of sequence records in my ref db, spanning the 671 reference genomes...around 2.2Gb in total size). I am then feeding the resulting 100+ bam files to the UnifiedGenotyper.
After some initial mistakes on my part (yes I have entered the future and am using GATK 2.2-5 now :) ) I've now started a run in proper fashion, but after a couple hours its dying with the message that the java application has run out of memory:
I had set -Xmx60g for that failed run, so now I'm wondering if its possible to estimate how much memory would be needed for this job I'm trying to run. Do you think a job of this size is even possible with the UG? Is it the number of references that is killing me here? Or the number of samples?
I am fairly new to GATK, but am trying to call SNPs in two bacterial strains against a single reference. In one strain the SNP is called, but not the other... looking at the alignment in IGV and also all sites (-out_mode EMIT_ALL_SITES) I can't understand why the SNP was not called in the second strain.
For the first strain, for which GATK calls the SNP NC_011770 9650 . C T 645.75 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=-2.149;DP=43;Dels=0.05;FS=4.191;HRun=2;HaplotypeScore=5.6633;MQ=64.95;MQ0=0;MQRankSum=0.878;QD=15.02;ReadPosRankSum=2.270;SB=-255.73 GT:AD:DP:GQ:PL 1/1:2,39:41:46.50:679,46,0
For the second strain, for which GATK does NOT call the SNP: NC_011770 9650 . C T 942.90 PASS AC=2;AF=1.00;AN=2;DP=53;Dels=0.06;FS=0.000;HRun=2;HaplotypeScore=23.7546;MQ=53.63;MQ0=0;QD=17.79;SB=-393.80 GT:AD:DP:GQ:PL 1/1:0,47:50:99:976,105,0
UnifiedGenotyper was called with these options:
-stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 100 -out_mode EMIT_ALL_SITES
Does anyone know why GATK does not call a SNP in the second strain?
Thanks for any help