This is two separate questions:
Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?
I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.
Many thanks and apologies if I've missed anything in the instructions.
Hello, I have a new sequenced genome with some samples for this specie, I would like to follow the best practices but I don't have a dbsnp or something similar, but could I use the variants from the samples as a dbsnp? for example get the variants that coincide in all my samples and use it as a dbsnp?
Hi all, I am working with GATK on illumina data that was created from yeasts (SK1 strain). Since it is sequencing of a colony and not the exact same organism, I am using a filter that I wrote on the ratio of the reads that support the alternative allele. Is anyone else using something similar? Is there a build-in filter like that?
I was calling SNP from Mouse samples using GATK and was in the step of "Variant quality score recalibration". The VariantRecalibrator walker asked me to provide training sets for mouse SNPs. The only SNP data (for the USCS mm9 assembly) I can find right now is the dbSNP. So I tried the run VariantRecalibrator like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R Refseq.fa -input snps.raw.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 snp128.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mode BOTH -recalFile output.recal -tranchesFile output.tranches -rscriptFile output.plots.R
However, the program asked for more:
I was wondering if I can change the parameters by setting both the training/truth to true:
or should I ignore the "-resource" option at the cost of being less accurate?
Any suggestions would be greatly appreciated.
I am fairly new to GATK, but am trying to call SNPs in two bacterial strains against a single reference. In one strain the SNP is called, but not the other... looking at the alignment in IGV and also all sites (-out_mode EMIT_ALL_SITES) I can't understand why the SNP was not called in the second strain.
For the first strain, for which GATK calls the SNP NC_011770 9650 . C T 645.75 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=-2.149;DP=43;Dels=0.05;FS=4.191;HRun=2;HaplotypeScore=5.6633;MQ=64.95;MQ0=0;MQRankSum=0.878;QD=15.02;ReadPosRankSum=2.270;SB=-255.73 GT:AD:DP:GQ:PL 1/1:2,39:41:46.50:679,46,0
For the second strain, for which GATK does NOT call the SNP: NC_011770 9650 . C T 942.90 PASS AC=2;AF=1.00;AN=2;DP=53;Dels=0.06;FS=0.000;HRun=2;HaplotypeScore=23.7546;MQ=53.63;MQ0=0;QD=17.79;SB=-393.80 GT:AD:DP:GQ:PL 1/1:0,47:50:99:976,105,0
UnifiedGenotyper was called with these options:
-stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 100 -out_mode EMIT_ALL_SITES
Does anyone know why GATK does not call a SNP in the second strain?
Thanks for any help
I'm trying to run GATK with the rice genome and I'm having trouble finding a known variant list of rice? Does anyone have a link or a more general list of known variant resources for other reference genomes?
Specifically, I found two rice genomes in dbSNP, but I can't find any documentation about which is which, and once I'm in the FTP site, which file I should be using?
Maybe take all the VCF files for each chromosome and cat them into one big file? Clearly, I'm a little clueless here :)