If you are sure that you cannot use VQSR / recalibrate variants (typically because your dataset is too small, or because there are no truth/training resources available for your organism), then you will need to use the VariantFiltration tool to manually filter your variants. To do this, you will need to compose filter expressions as explained here, here and here based on the recommendations detailed further below.
Let's be painfully clear about this: there is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, etc.
HOWEVER, because we want to help and people always say that something is better than nothing (not necessarily true, but let's go with that for now), we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.
In case you didn't catch that bit in bold there, we're saying that you absolutely SHOULD NOT expect to run these commands and be done with your analysis. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data.
In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters.
So, here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you:
QD < 2.0
MQ < 40.0
FS > 60.0
HaplotypeScore > 13.0only for variants output byUnifiedGenotyper; for HaplotypeCaller's output it is not informative
MQRankSum < -12.5
ReadPosRankSum < -8.0
QD < 2.0
ReadPosRankSum < -20.0
InbreedingCoeff < -0.8
FS > 200.0
The InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.
For shallow-coverage (<10x), it is virtually impossible to use manual filtering to reliably separate true positives from false positives. You really, really, really should use the protocol involving variant quality score recalibration. If you can't do that, maybe you need to take a long hard look at your experimental design. In any case you're probably in for a world of pain.
The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.
Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth.
HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer.
And we do genuinely want to help. So although we can't look at every single person's callset and give an opinion on how it looks (no, seriously, don't ask us to do that), we do want to hear from you about how we can best help you help yourself. What information do you feel would help you make informed decisions about how to set parameters? Are the meanings of the annotations not clear? Would knowing more about how they are computed help you understand how you can use them? Do you want more math? Less math, more concrete examples?
Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock)
We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands.
I would appreciate your thoughts on the following pipeline:
I'm currently working on a number of WGS of non-human vertebrates. My approach for calling variants is to maximize the sensitivity of the calls by using two callers (GATK's UnifiedGenotyper + samtools' mpileup) per chromosome regardless of / ingnoring all filters. Next, I would like to merge (not intersect) the two vcf files (GATK+samtools) per each chromosome, then merge (not intersect) all the vcf files pertaining to all chromosomes in order to retrieve a final vcf dataset per individual:
For merging the GATK and samtools:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta --variant:GATK chr#.GATK.vcf --variant:samtools chr#.samtools.vcf -o chr#.GATK_samtools.union.vcf -genotypeMergeOptions PRIORITIZE -priority GATK,samtools --filteredrecordsmergetype KEEP_UNCONDITIONAL
For merging all chromosomes per individual:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta --variant:chr1 chr1.GATK_samtools.union.vcf --variant:chr2 chr2.GATK_samtools.union.vcf --variant:chr3 chr3.GATK_samtools.union.vcf -o Individual1.union.vcf -genotypeMergeOptions PRIORITIZE -priority chr1,chr2,chr3 --filteredrecordsmergetype KEEP_UNCONDITIONAL
Finally I would like to intersect between two individuals and keep only the variants that are common to both individuals:
Uniting / merging two individuals:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta --variant:individual1 Individual1.union.vcf --variant:Individual2 Individual2.union.vcf -o Individual1_2.union.vcf -genotypeMergeOptions PRIORITIZE -priority Indiviual1,Individual2 --filteredrecordsmergetype KEEP_UNCONDITIONAL
Intersecting the two indiviuals in order to keep only common variants:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fasta --variant Individual1_2.union.vcf -select 'set == "Intersection";' -o Intersected.vcf
Am I doing this right? I'm afraid I may be losing variants or something else along this pipeline. Remember that I want to keep only the common variants while ignoring the filters in order to increase sensitivity as much as possible.
Hi, I am estimating SNP number for a genome of a speceis.
I found that the number of SNP in the fastq going through GATK is 10 times more than the first fastq. Interestingly, if I use picard to do duplicats-removomg again to the GATK bam and used samtools to convert the bam to fastq file. The SNP jumps back to 10 time fewer.
What can be the reason that the SNP number can be 10 time different between the two methods? Actually, I expect the GATK output file has fewer SNP given the effect of recaliraiton or relingement. But the result is opposite.