Hello First of all thank you for provide us with such complete and friendly set of functions and information. I will briefly explain the context of my research, so you'll be able to understand my final question.
I am performing a SNV calling (among another strategies) of two very close cyanobacteria strains, which are the most abundant strains from my samples, detected through different methods. None of the methods could tell us, definitely, if one or another are really there, or if it is a third unknown strain, because there is some genes missing in one and another. The dataset is a metagenomic Illumina paired end reads. The variant calling of this dataset against one "highly covered" reference yields a "treasure map" to explore the population variability, exchange and adaptability. Among different methods, two species emerged as potential "chimera backbones" guiding the large populational diversity of this lake. Both have almost the same sample coverage, both were considered the same species for years (even now), but they present local rearrangements, inversions and gene loss, that make them phenotypically different and change completely the toxin production. Even so, this two strains share >99 nucleotide similarity. This is not a problem, since I know were this differences are, and, in fact, I want to see the flanking regions and remaining synteny changes arising from this.
My first problem is: both complete genomes were generated by WGS and are in multi-fasta format of >90 contigs. The contigs are not ordered and the contig names doesn't tell me nothing about synteny or even parallelism between the two genomes (besides they were sequenced together...). When looking just for one of the outputs on Tablet, it is ok to deal with it, since the gff3 file guides me with the features and (fortunately) both have anotation databases to run SnpEff. Could be better, yes, if anyone has some idea or exerience dealing with multi-contigs reference genome, I appreciate any advice. For example, for one of them, I got stucked in the SelectVariant step among the cyclic recalibration pipeline (I don´t have known sites for this two genomes). I got the error from this link: https://www.broadinstitute.org/gatk/guide/article?id=1328, I ran the script, but it keeps telling me that my dict is probably corrupted, it shows me the contigs positions and, in fact, is is unordered, but all the other steps worked fine, and the entire pipeline, including the cyclic recalibration step, worked fine with the another genome, which has the exact same fasta structure.
My most important question: I need to compare the two calls. In a ideal world, I would like to "align/map" one against another (the syntenic regions been parallel and the clusters or genes missing between them appearing like gaps. Doing this, I could finally compare the SNV's, looking for common and divergent variant regions. I really dont know how to proceed with this comparison, since the contigs are not ordered and even knowing what contig correpond to another, for some punctual functions, the contigs length is different. So, I need any advice to help me to reach this concordance and discordance between 2 genomes, instead of 2 samples. If not possible for the whole genome calling, at least for one pair of contigs. Off course I looked for this contig correpondence and order but the references are pretty unclear, I am considering to send an email to the authors to ask them too.
Thank you so much, sorry by the long question.
I have used GATK for human. Now i have a need to call variants from bacteria. In case of human, known variants is fed in Base quality recalibration step, however, i do not have any know variants for bacteria, can i simply skip the step of Base quality recalibration? I gave a try by skipping it, but i got exceptionally huge number of SNPs. Are there any strict requirements for bacteria variant call?
I'm trying to call variants on metagenomic data using the UnifiedGenotyper. I know that the diploid genotype calls & likelihoods will not be valid since my data is not diploid, but I want to use the vcf output so sum up base frequencies at detected variant loci.
I mapped 100+ samples (each being ~2 Illumina GA2 lanes of data that after host filtering usually contain about 20-40 million reads per sample) against a database of 671 bacterial reference sequences (and each reference can be in multiple parts, so I probably have 10s of thousands of sequence records in my ref db, spanning the 671 reference genomes...around 2.2Gb in total size). I am then feeding the resulting 100+ bam files to the UnifiedGenotyper.
After some initial mistakes on my part (yes I have entered the future and am using GATK 2.2-5 now :) ) I've now started a run in proper fashion, but after a couple hours its dying with the message that the java application has run out of memory:
I had set -Xmx60g for that failed run, so now I'm wondering if its possible to estimate how much memory would be needed for this job I'm trying to run. Do you think a job of this size is even possible with the UG? Is it the number of references that is killing me here? Or the number of samples?
I am fairly new to GATK, but am trying to call SNPs in two bacterial strains against a single reference. In one strain the SNP is called, but not the other... looking at the alignment in IGV and also all sites (-out_mode EMIT_ALL_SITES) I can't understand why the SNP was not called in the second strain.
For the first strain, for which GATK calls the SNP NC_011770 9650 . C T 645.75 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=-2.149;DP=43;Dels=0.05;FS=4.191;HRun=2;HaplotypeScore=5.6633;MQ=64.95;MQ0=0;MQRankSum=0.878;QD=15.02;ReadPosRankSum=2.270;SB=-255.73 GT:AD:DP:GQ:PL 1/1:2,39:41:46.50:679,46,0
For the second strain, for which GATK does NOT call the SNP: NC_011770 9650 . C T 942.90 PASS AC=2;AF=1.00;AN=2;DP=53;Dels=0.06;FS=0.000;HRun=2;HaplotypeScore=23.7546;MQ=53.63;MQ0=0;QD=17.79;SB=-393.80 GT:AD:DP:GQ:PL 1/1:0,47:50:99:976,105,0
UnifiedGenotyper was called with these options:
-stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 100 -out_mode EMIT_ALL_SITES
Does anyone know why GATK does not call a SNP in the second strain?
Thanks for any help