We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.
We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:
scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.
In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.
A reverse example:
scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.
Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.
I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.
I am adapting a two-year old pipeline, which includes UnifiedGenotyper to call variants. I would like to update this to the HaplotypeCaller. In the command for the UnifiedGenotyper there is an option --metrics_file. I cannot find reference to this in the GATK documentation - do you know what it is and whether I can use it with HaplotypeCaller as well?
Also, do I understand correctly that using HaplotypeCaller to call variants means you don't have to carry out the local realignment around indels step prior to this?
Thanks very much,
Looking closely at our vcf file produced by HaplotypeCaller we noticed disagreement between PL values and GT in some variants. For example.
1 762589 rs71507461 G C 21714.69 PASS [clipped] GT:AD:DP:GQ:PL 1/1:0,33:33:99:1464,99,0 **0/1**:32,38:70:99:**0,9,2274** **0/1**:78,42:120:99:**323,30,0** **1/1**:11,5:96:99:**123,0,324** 1/1:2,84:86:99:3923,271,0 1/1:2,104:106:99:4945,348,0 1 762592 rs71507462 C G 21714.69 PASS [clipped] GT:AD:DP:GQ:PL 1/1:0,60:32:99:1464,99,0 0/1:586,0:67:99:1471,0,2274 0/1:77,42:119:99:1667,0,6152 1/1:1,95:96:99:4462,310,0 1/1:2,85:87:99:3923,271,0 1/1:2,101:103:99:4945,348,0 1 762601 rs71507463 T C 21714.69 PASS [clipped] GT:AD:DP:GQ:PL 1/1:1675,144:32:99:1464,99,0 0/1:30,38:68:99:1471,0,2274 0/1:79,42:121:99:1667,0,495 1/1:20,24:90:99:4462,0,476 1/1:2,83:85:99:3923,271,0 1/1:3,107:110:99:4945,348,0
What is the relationship between GT and PL - I initially thought PL determined GT? What other factors go into GATK deciding on a particular GT? For example, does GATK take into account the GT of nearby variants to determine the GT of an individual variant?
Thank you in advance for your help.
I'm starting to use use the HaplotypeCaller to identify variants in my exome projects, but I was wondering how it initially determines if a region has the potential to be variable. I couldn't find any useful documentation, online...so could anyone of you help me to basically understand this first step?
Thank you in advance.
I have exome data for a few sets of parent-offspring trios, in which offspring have phenotypically related but probably genetically different diseases. Their parents are unaffected so I am particularly interested in identifying de novo mutations. My plan was to analyse each individual separately up to the variant calling phase and then to input three (mother, father, child) analysis-ready BAMs into the UnifiedGenotyper along with a ped file. My questions are:
1) Can you tell me whether the UnifiedGenotyper uses pedigree information in the ped file to call genotypes more accurately? In other words, is this better than calling variants jointly without supplying the ped file? If not, does GATK recommend any external tools for doing this step?
2) It is better to call variants jointly using all the trios (even though they are not related and probably don't share the same disease-causing mutations)?
I'm trying to call variants from bowtie-aligned reads, I used PrintReads with ReassignMappingQuality filter to give all reads a mapping score of 60 to replace default value of 255. However, I'm wondering if this assignment would introduce any bias in variant calling.
Thanks a lot!
Hi to all, I ran UnifiedGenotyper of three exome seq samples and phased with familial pedigree. During the manual filtration I saw several inconsistency. For example I get this output from UnifiedGenotyper and phasing:
chr2 38525498 rs76204302 T G 66.79 . AC=2;AF=0.333;AN=6;BaseQRankSum=-7.191;DB;DP=180;Dels=0.00;FS=208.951;HaplotypeScore=13.8978;MLEAC=2;MLEAF=0.333;MQ=60.00;MQ0=0;MQRankSum=2.030;QD=0.52;ReadPosRankSum=1.325;SB=-2.263e+00 GT:AD:DP:GQ:PL 0/0:30,22:52:39:0,39,729 0/1:9,7:16:5:5,0,280 0/1:55,57:112:98:98,0,1169
The order of samples are father, mother and son.How is possible that the father with, respectively 30 bases REF and 22 bases VAR is called 0/0? Thanks
I just had a look at the code of the UnifiedGenotyper and how its Variant Calling algorithm is implemented (very well documented by the way :) . But now I wonder how GATK reads in the SAM file and finds out where the differences to the reference (SNPs, Indels) are that are then examined in the Variant Calling. I only see that the UnifiedGenotyper gets a set of alleles, but not where the alleles are actually determined. I have also found out that GATK is using Samtools for parsing SAM files, but have not found the point where the actual reads are parsed and processed (e.g., by using the CIGAR string). Are you maybe doing a local realignment before the actual Variant Calling from which you get the alleles?
I would appreciate if you could guide me to the right direction where this happens in the code.
we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.
details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:
@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1
@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1
@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1
@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1
@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1
@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1
Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?
Does that make sense? Any advice?
i have sequenced two exomes from two affected individuals from the same family. i have called their variants with the recommended UnifiedGenotyper protocol for each sample individually (was done a year and half ago..), and then did theintersection to find teh shared ones. i was wondering, if you recommend calling them at one step, togther? is there any added value for doing so (in term of reducing false positives, accuracy etc..).
I have been trying to find documentation for understanding the annotations and numbers in the VCF file. Something like below, can someone guide me how to understand/interpret the numbers? How is the quality of the variant calling for this particular indel?
AC=2; AF=0.100; AN=20; BaseQRankSum=6.161; ClippingRankSum=-2.034; DP=313; FS=5.381; InbreedingCoeff=-0.1180; MLEAC=2; MLEAF=0.100; MQ=58.49; MQ0=0; MQRankSum=-0.456; QD=1.46; ReadPosRankSum=-4.442; VQSLOD=0.348; topculprit=ReadPosRankSum
Hi, I’m trying to use GATK for variant calling. I have had some problems preparing the hg19 reference genome. I haven’t done the alignment myself, but gotten the .bam files already done. I how ever know that this should be the same reference used for the alignment. I have downloaded the hg19 from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz and downloaded the new version of mitochondrial genome (NC_012920.1). and then tried: cat chr*.fa > hg19.fa
After this I have followed the instructions on (how to) Prepare a reference for use with BWA and GATK.
When I try to call the variants using HaplotypeCaller (as instructed on: (howto) Call variants on a diploid genome with the HaplotypeCaller):
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I 8526RU.rmdup.bam -L 20 --genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY -stand_emit_conf 10 -stand_call_conf 30 -o raw_variants8526RU.vcf
I get the error message: “Badly formed genome loc: Contig '20' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?” Can you tell me what the problem is? And how to fix this? I know there have been some similar questions considering the contigs, but I haven’t been able to solve the problem based on them.
Thank you, Sini