The answer depends on what tool we're talking about, and whether we're considering variant discovery or variant manipulation.
GATK variant manipulation tools are able to recognize the following types of alleles:
Of our two variant callers, UnifiedGenotyper is the more limited, as it only calls SNPs and indels, and does so separately (even if you run in calling mode BOTH, the program performs separate calling operations internally). The HaplotypeCaller is more sophisticated and calls different types of variants at the same time. So in addition to SNPs and indels, it is capable of emitting mixed records by default. It is also capable of emitting MNPs and symbolic alleles, but the modes to do so are not enabled by default and they are not part of our recommended best practices for the tool.
The GATK currently does not handle SVs (structural variations) or CNVs (copy number variations), but there are some third-party software packages built on top of GATK that provide this functionality. See GenomeSTRiP for SVs and XHMM for CNVs.
Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs) and in the HaplotypeCaller (for all variants including indels). So, before you post this issue in our support forum, please do a little bit of investigation on your own, as follows.
To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.
Here is a checklist of questions you should ask yourself:
The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the
--max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.
Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.
Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.
A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!
Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.
By default the UG will only consider a certain number of alternate alleles. This value can be set using the
--max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.
SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.
Hi! I have worked some time on a mRNAseq set, single-end. Its a high quality set and lots of biological replicates (200+).
My question is, how could I best contribute to the methodology used for SNPs call in mRNAseq? What do we need tested to improve this method?
So I have used the latest GATK best practices pipeline for variant detection on non-human organisms, but now I am trying to do it for human data. I downloaded the Broad bundle and I was able to run all of the steps up to and including ApplyRecalibration. However, now I am not exactly sure what to do. The VCF file that is generated contains these FILTER values:
I am not sure what these mean. Does the "VQSRTrancheSNP99.90to100.00" filter mean that that SNP falls below the specified truth sensitivity level? Does "PASS" mean that it is above that level? Or is it vice versa? And what does "." mean? Which ones should I keep as "good" SNPs?
I'm also having some difficulty fully understanding how the VQSLOD is used.... and what does the "culprit" mean when the filter is "PASS"?
A final question.... I've been using this command to actually create a file with only SNPs that PASSed the filter:
java -Xmx2g -jar /share/apps/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T SelectVariants -R ~/broad_bundle/ucsc.hg19.fasta --variant Pt1.40300.output.recal_and_filtered.snps.chr1.vcf -o Pt1.40300.output.recal_and_filtered.passed.snps.chr1.vcf -select 'vc.isNotFiltered()'
Is this the correct way to get PASSed SNPs? Is there a better way? Any help you can give me would be highly appreciated. Thanks!
I have a question about SNP calling from GATK.
We want to call the SNP from many individuals. We hope that we could have the right confident Genotype SNP. But when I check the SNP results, I find some problem, such as the following example:
scaffold_1 12816 . C T 5075.21 TruthSensitivityTranche99.90to100.00 AC=20;AF=0.417;AN=48;BaseQRankSum=14.475;DB;DP=2186;Dels=0.00;FS=1.453;HRun=1;HaplotypeScore=1.2508;InbreedingCoeff=0.8258;MQ=23.69;MQ0=948;MQRankSum=-11.287;QD=10.78;ReadPosRankSum=13.705;SB=-276.66;VQSLOD=-17.8239;culprit=ReadPosRankSum
0/0:204,0:204:99:0,366,3726 **1/1:19,20:39:27.03:249,27,0 ** 0/0:110,0:110:99:0,219,2229 0/1:54,20:74:27.15:27,0,871 0/0:98,0:98:99:0,156,1701 0/0:208,0:211:99:0,490,5141 0/0:147,0:148:99:0,279,2915 ./. 1/1:6,22:28:27.04:282,27,0 1/1:0,31:31:39.04:360,39,0 0/0:220,0:220:99:0,424,4397 ./. 1/1:0,30:30:75.07:692,75,0 0/1:25,12:37:99:101,0,637 1/1:0,36:36:69.07:636,69,0 0/0:39,0:39:87.10:0,87,852 1/1:23,61:84:99:1107,120,0 1/1:0,38:38:78.08:736,78,0 ./. ./. 0/0:131,0:131:99:0,213,2230 1/1:6,37:43:57.08:593,57,0 1/1:12,19:31:30.04:293,30,0 0/0:27,8:35:54.08:0,54,537 ./. 0/0:64,0:65:96.13:0,96,993 0/0:116,1:117:99:0,207,2124 ./. 0/0:5,0:5:15.03:0,15,162 0/0:95,0:95:99:0,192,2012
I have 30 individuals that need to call the SNPs. I find that GT are different based on the reads depth. for one individual: 1/1:19,20:39:27.03:249,27,0. The two alleles are covered by 19 and 20 reades, the Genotype should be 0/1 as a heterozygosity. but GATK call it as a homozygosity. What is the problem?
There are lots of this situation in my data. How can I make sure the GATK's SNP are right or wrong? Hope you can give me some help?
I am running the Quality score recalibration and got an version error like this:
`java -Xmx4g -jar ~/GenomeAnalysisTK-2.1-3/GenomeAnalysisTK.jar -l INFO -R hg19.fa --DBSNP snp137.txt -I sdD1a_bam.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile sdD1a.recal_data.csv &  9980 [c@scigc-vm-10 shones]$ ##### ERROR ------------------------------------------------------------------------------------------
Is there a solution for that? Thanks.
I have the genomes of several isolates of a parasite, and I would like to investigate synonymous/non-synonymous substitution for identifying potential antigens, as well as SNPs genome-wide and I am wondering how well BWA/GATK are suited for this purpose. I've been told that BWA is only very good with sequences <2% divergent, and some of the antigens in this specie are known to be >20% divergent. However, I also know that GATK does local realignments of indels. So I would specifically like to know - is BWA/GATK good for looking at substitutions/SNPs in highly variable genes, and if not which other alignment tools are compatible and appropriate for this purpose?
I've had the same DNA sample sequenced using two different library prep methods, NEB-Next and Illumina-Nextera, the latter of which generates twice the number of reads than the former. When genotypes for the two samples are called individually using the UnifiedGenotyper, the read depth, DP for the Illumina-Nextera is roughly twice that of the NEB-Next but the quality score, QUAL, remains similar. I was expecting the QUAL to reflect the read depth but obviously this is not the case. Could you enlighten me?
This is two separate questions:
Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?
I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.
Many thanks and apologies if I've missed anything in the instructions.
Hi :), I'am new to NGS and have a questions reagarding the filtering. I have 13 individuals, 3 of them with a coverage of 11x-15x, the rest with a coverage of 5x-7x. I did multi-sample SNPcalling and hard filtering, the latter as there are no known SNPs so far. Now I am not sure how to set the minimal SNP quality (QUAL). On the best practise it is suggested to juse at least 30 for individuals with a coverage of at least 30, but 4 if the coverage is below 10. So what is the best way to set the QUAL filter?
many thanks in advance
In working with a mapping that has a very large coverage in certain places, I have noticed that the Unified Genotyper often calls SNPs even if the alternate allele is only present in a minuscule <1% fraction of the reads at a position. It is difficult to filter these SNPs out because QUAL is high for these SNPs the and QD, which is low in these situations, is also low for many other (good) SNPs.
There already is a fractional threshold option for indel calling, -minIndelFrac. Is there any similar option for SNPs – and if not, what is your reasoning for omitting this option and what steps do you recommend for filtering out these SNPs?
I have Bisulfite- treated sequence mapped using Bismark and Bowtie2 and I'd like to call SNPs and INDELs from it. I have used Bis-SNP to call SNPs but it doesn't call indels , can I use GATK to call indels from the mapped data? Do u have any support to Bisulfite data? Another question please, the data is a mix from 6 different people do u have any support fro pooled data? Thanks for your help.
I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?
I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.
Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.
Basically I have an odd-looking distribution of my variant quality scores (see attached png), and was wondering how concerned should I be and how can I rectify it.
The input data from the graph is from UnifiedGenotyper vcf output file, QUAL values. The four samples in the vcf file are one Drosophila reference line, and three more which are outcrosses of the reference line and thus are heterozygous for the reference allele.
My fastq read-mapping pipeline includes adapter and low-quality base removal, and local re-alignment. I've also attached a pdf showing read quality distribution from one of the samples which also looks a bit odd.
The Unified Genotyper calls SNPs relative to the specified, publicly-available reference assembly.
How can I call SNPs (in many samples, with UG) relative to an in-house individual, which I have sequenced at high-coverage?
My current solution is to perform a de novo assembly on the in-house reference individual using e.g. Velvet, and then simply use the fasta as the reference for UG.
Can the publicly-available reference assembly still be useful here for speeding up the mapping and filling-in missing parts ?
My organism is Drosophila melanogaster.
Hi to all, I ran UnifiedGenotyper of three exome seq samples and phased with familial pedigree. During the manual filtration I saw several inconsistency. For example I get this output from UnifiedGenotyper and phasing:
chr2 38525498 rs76204302 T G 66.79 . AC=2;AF=0.333;AN=6;BaseQRankSum=-7.191;DB;DP=180;Dels=0.00;FS=208.951;HaplotypeScore=13.8978;MLEAC=2;MLEAF=0.333;MQ=60.00;MQ0=0;MQRankSum=2.030;QD=0.52;ReadPosRankSum=1.325;SB=-2.263e+00 GT:AD:DP:GQ:PL 0/0:30,22:52:39:0,39,729 0/1:9,7:16:5:5,0,280 0/1:55,57:112:98:98,0,1169
The order of samples are father, mother and son.How is possible that the father with, respectively 30 bases REF and 22 bases VAR is called 0/0? Thanks
I have used
UnifiedGenotyper to call SNPS. I found some SNPs that has been reported from low quality reads in chromosome X and chromosome Y. Is it possible not to take low quality reads into account while calling SNPs using
UnifiedGenotyper? Or, do I need to do quality filtering of BAM files before hand ?
I ran in to the situation now a couple of times that I need to extract a set of private SNPs from a multisample VCF file. For example in a forward genetics knockout screen of a large set of samples.
It is possible with vcf-contrast from vcf-tools:
vcf-contrast +sample1 -sample2 -sample3 -n input.vcf > private sample1.vcf
vcf-contrast -sample1 +sample2 -sample3 -n input.vcf > private sample2.vcf
vcf-contrast -sample1 -sample2 +sample3 -n input.vcf > private sample3.vcf
After this I still would have to filter out the private 0/0 calls and doing this for a large multisample VCF means entering this command for all the combinations which is not really nice.
Surely this must be possible with GATK. Does anyone know how to do this with GATK.
Maybe it is somewhere in the SelectVariants? The --discordance option looked promissing but there is something about that the samples should be the same? Or is it possible to write another variant walker or a JEXL expression?
P.S. By accident I also posted this question in the XHMM, an admin could remove it there.
Hi All, I have used GATK UnifiedGenotyper to generate a raw.vcf file. Now I want to use GATK VQSR to get a more accurate result ,and I follow this protocol:
snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP); indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL); recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP); analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL);
I wanna know will the it be better if I seperate the SNP and INDEL to run VQSR, like this:
SNP.raw.vcf , INDEL.raw.vcf <- SeperateSNPINDEL(raw.vcf); snp.model <- BuildErrorModelWithVQSR(SNP.raw.vcf, SNP); indel.model <- BuildErrorModelWithVQSR( INDEL.raw.vcf, INDEL); SNP_analysisReady.vcf <- ApplyRecalibration(SNP.raw.vcf, snp.model, SNP); INDEL_analysisReady.vcf <- ApplyRecalibration(INDEL.raw.vcf, INDEL.model, SNP);
Thanks a lot !
We are attempting to see if using ReducedReads will help with the overwhelming file sizes for the SNP calling we are doing on whole genome BAM files. We have been using a protocol similar to the one described in best practices document: Best: multi-sample realignment with known sites and recalibration. My question is what is the best point in the pipeline to use ReducedReads?
I am implementing a tool that uses reads to identify potential sites for cancer, but I need a reliable genotype call from non-cancerous tissue. Currently, I'm trying to use GATK to make those calls, filter out the bad, and then use the remaining calls when I look over the raw reads in the cancer tissue.
The problem I'm finding is that GATK doesn't provide a GQ for variants that are homozygous for the reference, and I don't know how to correlate the QUAL with the GQs from those homozygous non-ref. Is there a way that I'm overlooking to have GATK provide the GQ for homozygous reference non-variant sites?
Leishmania has 36 chromosomes but their copy number is unpredictable for each strain and chromosome copy number can change very quickly. So what is an optimal ploidy setting for organisms with extensive aneuploidy? So far we use just diploid setting. Some samples have consistently more heterozygous SNPs in higher copy chromosomes but this relationship does not hold in many other samples: there is no strong correlation between chromosome copy number and abundance of heterozygous SNPs.