Hello there! Thanks as always for the lovely tools, I continue to live in them.
Methods Thus Far
We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).
As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.
My semi-educated guess..
The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?
I'm trying to run GATK with the rice genome and I'm having trouble finding a known variant list of rice? Does anyone have a link or a more general list of known variant resources for other reference genomes?
Specifically, I found two rice genomes in dbSNP, but I can't find any documentation about which is which, and once I'm in the FTP site, which file I should be using?
Maybe take all the VCF files for each chromosome and cat them into one big file? Clearly, I'm a little clueless here :)
This is two separate questions:
Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?
I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.
Many thanks and apologies if I've missed anything in the instructions.
Hello, I have a new sequenced genome with some samples for this specie, I would like to follow the best practices but I don't have a dbsnp or something similar, but could I use the variants from the samples as a dbsnp? for example get the variants that coincide in all my samples and use it as a dbsnp?
I was calling SNP from Mouse samples using GATK and was in the step of "Variant quality score recalibration". The VariantRecalibrator walker asked me to provide training sets for mouse SNPs. The only SNP data (for the USCS mm9 assembly) I can find right now is the dbSNP. So I tried the run VariantRecalibrator like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R Refseq.fa -input snps.raw.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 snp128.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mode BOTH -recalFile output.recal -tranchesFile output.tranches -rscriptFile output.plots.R
However, the program asked for more:
I was wondering if I can change the parameters by setting both the training/truth to true:
or should I ignore the "-resource" option at the cost of being less accurate?
Any suggestions would be greatly appreciated.
Hi all, I am working with GATK on illumina data that was created from yeasts (SK1 strain). Since it is sequencing of a colony and not the exact same organism, I am using a filter that I wrote on the ratio of the reads that support the alternative allele. Is anyone else using something similar? Is there a build-in filter like that?
I am fairly new to GATK, but am trying to call SNPs in two bacterial strains against a single reference. In one strain the SNP is called, but not the other... looking at the alignment in IGV and also all sites (-out_mode EMIT_ALL_SITES) I can't understand why the SNP was not called in the second strain.
For the first strain, for which GATK calls the SNP NC_011770 9650 . C T 645.75 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=-2.149;DP=43;Dels=0.05;FS=4.191;HRun=2;HaplotypeScore=5.6633;MQ=64.95;MQ0=0;MQRankSum=0.878;QD=15.02;ReadPosRankSum=2.270;SB=-255.73 GT:AD:DP:GQ:PL 1/1:2,39:41:46.50:679,46,0
For the second strain, for which GATK does NOT call the SNP: NC_011770 9650 . C T 942.90 PASS AC=2;AF=1.00;AN=2;DP=53;Dels=0.06;FS=0.000;HRun=2;HaplotypeScore=23.7546;MQ=53.63;MQ0=0;QD=17.79;SB=-393.80 GT:AD:DP:GQ:PL 1/1:0,47:50:99:976,105,0
UnifiedGenotyper was called with these options:
-stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 100 -out_mode EMIT_ALL_SITES
Does anyone know why GATK does not call a SNP in the second strain?
Thanks for any help