I am currently using GATK (GenomeAnalysisTK-3.3-0) to assess the level of heterozygosity in individual lizards (diploid). There is no known set SNPs or INDELs, so I have been using the bootstrapping protocol recommended in the best practices. Based on the recalibration plots, it appears that the before and after quality scores have converged, however the total number of filtered SNPs continues to change for each iteration of recalibration.
Should we expect the number of filtered SNPs to stop changing as we continue to bootstrap? Or should I be concerned about my dataset...
Some basic information: We are using the unified genotyper for variant calling. We have now completed 10 rounds of recalibration. We are counting the number of SNPs that pass filtering
If I count the number of SNPs in unix: for file in *.recode.vcf; do echo $file; cat $file | grep -v "#" | wc -l; done
all_SNPs_4.recode.vcf 922016 all_SNPs_5.recode.vcf 929913 all_SNPs_6.recode.vcf 923369 all_SNPs_7.recode.vcf 922344 all_SNPs_8.recode.vcf 923250 all_SNPs_9.recode.vcf 924366
I can provide further information if it would be helpful.
Thanks, Duncan Tormey firstname.lastname@example.org
Hello, I would like to call variants on non human genome (S.Cerevisiae). I understood that I should perform These steps:
my questions are: How should I perform the filtration of the variants in 3? do you have any recommendations? I would like to find variants only for one sample. should I use more samples for creating the variant data base, or one sample can be enough (I have 7 samples of that project, and I can use them)?
I've effectively got a sample of 200 individuals, that share one father but each have a different mother. I'm wondering how to best change the default parameters of Unified Genotyper to call variants without bias caused by assumptions of the algorithm.
Hey GATK team,
How about having a prominent category for queries relating to non-human data? There are a sizable number of us who work with birds, bees, dogs, fungi..., and it would help if we can directly go through a link rather than sift through myriad of answers that is not of much relevance to us.!()
Hello there! Thanks as always for the lovely tools, I continue to live in them.
Methods Thus Far
We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).
As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.
My semi-educated guess..
The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?
I'm trying to run GATK with the rice genome and I'm having trouble finding a known variant list of rice? Does anyone have a link or a more general list of known variant resources for other reference genomes?
Specifically, I found two rice genomes in dbSNP, but I can't find any documentation about which is which, and once I'm in the FTP site, which file I should be using?
Maybe take all the VCF files for each chromosome and cat them into one big file? Clearly, I'm a little clueless here :)
This is two separate questions:
Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?
Many thanks and apologies if I've missed anything in the instructions.
Hello, I have a new sequenced genome with some samples for this specie, I would like to follow the best practices but I don't have a dbsnp or something similar, but could I use the variants from the samples as a dbsnp? for example get the variants that coincide in all my samples and use it as a dbsnp?
I was calling SNP from Mouse samples using GATK and was in the step of "Variant quality score recalibration". The VariantRecalibrator walker asked me to provide training sets for mouse SNPs. The only SNP data (for the USCS mm9 assembly) I can find right now is the dbSNP. So I tried the run VariantRecalibrator like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R Refseq.fa -input snps.raw.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 snp128.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mode BOTH -recalFile output.recal -tranchesFile output.tranches -rscriptFile output.plots.R
However, the program asked for more:
I was wondering if I can change the parameters by setting both the training/truth to true:
or should I ignore the "-resource" option at the cost of being less accurate?
Any suggestions would be greatly appreciated.
Hi all, I am working with GATK on illumina data that was created from yeasts (SK1 strain). Since it is sequencing of a colony and not the exact same organism, I am using a filter that I wrote on the ratio of the reads that support the alternative allele. Is anyone else using something similar? Is there a build-in filter like that?
I am fairly new to GATK, but am trying to call SNPs in two bacterial strains against a single reference. In one strain the SNP is called, but not the other... looking at the alignment in IGV and also all sites (-out_mode EMIT_ALL_SITES) I can't understand why the SNP was not called in the second strain.
For the first strain, for which GATK calls the SNP NC_011770 9650 . C T 645.75 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=-2.149;DP=43;Dels=0.05;FS=4.191;HRun=2;HaplotypeScore=5.6633;MQ=64.95;MQ0=0;MQRankSum=0.878;QD=15.02;ReadPosRankSum=2.270;SB=-255.73 GT:AD:DP:GQ:PL 1/1:2,39:41:46.50:679,46,0
For the second strain, for which GATK does NOT call the SNP: NC_011770 9650 . C T 942.90 PASS AC=2;AF=1.00;AN=2;DP=53;Dels=0.06;FS=0.000;HRun=2;HaplotypeScore=23.7546;MQ=53.63;MQ0=0;QD=17.79;SB=-393.80 GT:AD:DP:GQ:PL 1/1:0,47:50:99:976,105,0
UnifiedGenotyper was called with these options:
-stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 100 -out_mode EMIT_ALL_SITES
Does anyone know why GATK does not call a SNP in the second strain?
Thanks for any help