# Tagged with #variant calling 0 documentation articles | 0 announcements | 30 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.

Hi,

If I get it right, anomalous read pairs refer to those where only one read is mapped. I wonder whether GATK would ever use these reads to do the calling.

Thank you!

Hi all - I'm stumped and need your help. I'm following the GATK best practices for calling variants with HaplotypeCaller in GVCF mode. One of my samples is NA12878, among 119 others samples in my cohort. For some reason GATK is missing a bunch of variants in this sample that I can clearly see in IGV but are not listed in the VCF. I discovered that the variant is being filtered out..reason being VQSRTranchesSNP99.00to99.90. The genotype is homozygous variant, DP is 243, Qual is 524742.54 and its known in dbSNP. I suspect this is happening to other variants.

How do I adjust VQSR or how tranches are used and variants get placed in? I supposed I need to fine tune my parameters...but I would think something as obvious as this variant would pass Filtering.

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Alva

Hi,

I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is:

16050036 16050612 16050822 16050933 16051556 16051968 16051994 16052080 16052167 16052239 16052250 16052357 etc.

whereas the initial set of sites in my final vcf file is:

16050822 16050933 16051347 16051497 16051556 16051968 16051994 16052080 16052167 16052169 16052239 16052357 etc.

First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring.

I also got this warning 3 times when running GenotypeGVCFs: WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3).

FYI, this is the command I used for GenotypeGVCFs: java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22 (only running this for chr22)

If I want the variants to be called only if they fit the following criteria:

1) Min. total coverage for consideration of heterozygous is 10.

2) Min. coverage of each of the two observed major basecalls to be called heterozygous is 5.

3) Min. percentage of each of the two observed major basecalls in order to be called heterozygous is 20.

4) Min. coverage in order for a position to be called homozygous is 6.

which command-line arguments in which tools (HaplotpyeCaller or GenotypeGVCFs) can I use to accomplish these? I cannot seem to find the proper arguments in the documentation. I apologize if I overlook.

Thank you

I used GATK 3.3 for calling SNPs for my exome data. I also restricted my SNPs to be called specifying a BED file. When I annotated the vcf file i get lot of SNPs as intronic. What can be its possible reason?

I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).

Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.

indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538

missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125

I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.

I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.

Hi,

Thank you for providing guidelines on RNA-Seq variant discovery. For our data, we are currently playing with multiple mapping methods and have noticed that 2-step alignments work "better" than 1-step alignments. By 2-step alignments, I mean using STAR as step1 and then take the unmapped from this and use another aligner (like Bowtie2) for alignment. If I use such a methodology, will there be an issue in variant calling when during splitting cigar strings I ask it convert the 255 MAPQ to another value (like 60 in the best practices example), since bowtie2 gives different MAPQ scores. Sorry if this seems like a stupid question, but I am just a little curious how such a thing might affect the variant calls. Any insights/comment on this will be greatly appreciated.

Thanks!

I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.

I'm having trouble running the haplotypecaller on a few of my samples. Before I go full scale on all of my samples I am attempting the pipeline on a small number of my samples. When I try running haplotypecaller on my samples individually with the command below I get a stack trace error (also below). Usually I can figure out the stack trace errors and correct them, but I am having trouble with this one. With this error, the program runs for a time then errors out (sometime it runs longer then other times). It doesn't occur for every sample and when it does occur, if I re-run that same individual and same command again it sometimes goes to completion without errors. Other times I might have to run that individual/command multiple times before I can get it to complete without errors. Any insight would be greatly appreciated.

Command: java -Xmx16g -jar ~/programs/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../miliaris_ref.fa -I realigned/27861_realigned.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --max_alternate_alleles 2 -mbq 30 -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 4.0 -stand_emit_conf 3.0 -o rawSNPS_Q4/27861_rawSNPS_Q4.vcf -nct 8 -A HaplotypeScore -A FisherStrand -A BaseQualityRankSumTest -A MappingQualityRankSumTest -A ReadPosRankSumTest -A QualByDepth -A VariantType -A LowMQ

##### ERROR ------------------------------------------------------------------------------------------

I already notice that it can be some compatible problem with my file: isogg.vcf, because when I replace that file with another vcf file, it runs successfully, but I can not spot what makes the difference. so I list the header of the vcf file as the following,

additionally, this vcf file works fine with realignment

# CHROM POS ID REF ALT QUAL FILTER INFO

Y 2649696 M236 G T . PASS ISOGG Y 2655180 M176 C G . PASS ISOGG Y 2656127 Z12426 G A . PASS ISOGG Y 2656959 L1233 G C . PASS ISOGG

Thank you.

We are running GATK HaplotypeCaller on ~50 whole exome samples. We are interested in rare variants - so we ran GATK in single sample mode instead of multi sample as you recommend, however we would like to take advantage of VQSR. What would you recommend? Can we run VQSR on the output from GATK single sample?

Additionally, we are likely to run extra batches of new exome samples. Should we wait until we have them all before running them through the GATK pipeline?

Hi,

I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant.

However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged.

I think I recall another post having a similar issue due to multithreaded use of select variants

APologies for not commenting on this post instead as I had already posted this prior to seeing the other post!

Thanks,

MC

Looking closely at our vcf file produced by HaplotypeCaller we noticed disagreement between PL values and GT in some variants. For example.

1   762589  rs71507461  G   C   21714.69    PASS    [clipped]   GT:AD:DP:GQ:PL  1/1:0,33:33:99:1464,99,0    **0/1**:32,38:70:99:**0,9,2274**    **0/1**:78,42:120:99:**323,30,0**   **1/1**:11,5:96:99:**123,0,324**    1/1:2,84:86:99:3923,271,0   1/1:2,104:106:99:4945,348,0
1   762592  rs71507462  C   G   21714.69    PASS    [clipped]   GT:AD:DP:GQ:PL  1/1:0,60:32:99:1464,99,0    0/1:586,0:67:99:1471,0,2274 0/1:77,42:119:99:1667,0,6152    1/1:1,95:96:99:4462,310,0   1/1:2,85:87:99:3923,271,0   1/1:2,101:103:99:4945,348,0
1   762601  rs71507463  T   C   21714.69    PASS    [clipped]   GT:AD:DP:GQ:PL  1/1:1675,144:32:99:1464,99,0    0/1:30,38:68:99:1471,0,2274 0/1:79,42:121:99:1667,0,495 1/1:20,24:90:99:4462,0,476  1/1:2,83:85:99:3923,271,0   1/1:3,107:110:99:4945,348,0


What is the relationship between GT and PL - I initially thought PL determined GT? What other factors go into GATK deciding on a particular GT? For example, does GATK take into account the GT of nearby variants to determine the GT of an individual variant?

Rosalie

I have been trying to find documentation for understanding the annotations and numbers in the VCF file. Something like below, can someone guide me how to understand/interpret the numbers? How is the quality of the variant calling for this particular indel?

AC=2; AF=0.100; AN=20; BaseQRankSum=6.161; ClippingRankSum=-2.034; DP=313; FS=5.381; InbreedingCoeff=-0.1180; MLEAC=2; MLEAF=0.100; MQ=58.49; MQ0=0; MQRankSum=-0.456; QD=1.46; ReadPosRankSum=-4.442; VQSLOD=0.348; topculprit=ReadPosRankSum

i have sequenced two exomes from two affected individuals from the same family. i have called their variants with the recommended UnifiedGenotyper protocol for each sample individually (was done a year and half ago..), and then did theintersection to find teh shared ones. i was wondering, if you recommend calling them at one step, togther? is there any added value for doing so (in term of reducing false positives, accuracy etc..).

Hi, I’m trying to use GATK for variant calling. I have had some problems preparing the hg19 reference genome. I haven’t done the alignment myself, but gotten the .bam files already done. I how ever know that this should be the same reference used for the alignment. I have downloaded the hg19 from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz and downloaded the new version of mitochondrial genome (NC_012920.1). and then tried: cat chr*.fa > hg19.fa

After this I have followed the instructions on (how to) Prepare a reference for use with BWA and GATK.

When I try to call the variants using HaplotypeCaller (as instructed on: (howto) Call variants on a diploid genome with the HaplotypeCaller):

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I 8526RU.rmdup.bam -L 20 --genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY -stand_emit_conf 10 -stand_call_conf 30 -o raw_variants8526RU.vcf

I get the error message: “Badly formed genome loc: Contig '20' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?” Can you tell me what the problem is? And how to fix this? I know there have been some similar questions considering the contigs, but I haven’t been able to solve the problem based on them.

Thank you, Sini

Hi there,

I'm starting to use use the HaplotypeCaller to identify variants in my exome projects, but I was wondering how it initially determines if a region has the potential to be variable. I couldn't find any useful documentation, online...so could anyone of you help me to basically understand this first step?

Hello all,

We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.

We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:

scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.

In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.

A reverse example:

scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.

Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.

I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.

we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.

details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:

sample 1:

@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1

@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1

@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1

sample 2:

@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1

@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1

@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1

Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?

Does that make sense? Any advice?

Hello!

I'm trying to call variants from bowtie-aligned reads, I used PrintReads with ReassignMappingQuality filter to give all reads a mapping score of 60 to replace default value of 255. However, I'm wondering if this assignment would introduce any bias in variant calling.

Thanks a lot!

Hi,

I am adapting a two-year old pipeline, which includes UnifiedGenotyper to call variants. I would like to update this to the HaplotypeCaller. In the command for the UnifiedGenotyper there is an option --metrics_file. I cannot find reference to this in the GATK documentation - do you know what it is and whether I can use it with HaplotypeCaller as well?

Also, do I understand correctly that using HaplotypeCaller to call variants means you don't have to carry out the local realignment around indels step prior to this?

Thanks very much,

Kath

Hi there,

I just had a look at the code of the UnifiedGenotyper and how its Variant Calling algorithm is implemented (very well documented by the way :) . But now I wonder how GATK reads in the SAM file and finds out where the differences to the reference (SNPs, Indels) are that are then examined in the Variant Calling. I only see that the UnifiedGenotyper gets a set of alleles, but not where the alleles are actually determined. I have also found out that GATK is using Samtools for parsing SAM files, but have not found the point where the actual reads are parsed and processed (e.g., by using the CIGAR string). Are you maybe doing a local realignment before the actual Variant Calling from which you get the alleles?

I would appreciate if you could guide me to the right direction where this happens in the code.

Best regards,

Cindy

Hi,

I have exome data for a few sets of parent-offspring trios, in which offspring have phenotypically related but probably genetically different diseases. Their parents are unaffected so I am particularly interested in identifying de novo mutations. My plan was to analyse each individual separately up to the variant calling phase and then to input three (mother, father, child) analysis-ready BAMs into the UnifiedGenotyper along with a ped file. My questions are:

1) Can you tell me whether the UnifiedGenotyper uses pedigree information in the ped file to call genotypes more accurately? In other words, is this better than calling variants jointly without supplying the ped file? If not, does GATK recommend any external tools for doing this step?

2) It is better to call variants jointly using all the trios (even though they are not related and probably don't share the same disease-causing mutations)?

Best wishes,

Kath

Hi to all, I ran UnifiedGenotyper of three exome seq samples and phased with familial pedigree. During the manual filtration I saw several inconsistency. For example I get this output from UnifiedGenotyper and phasing:

chr2 38525498 rs76204302 T G 66.79 . AC=2;AF=0.333;AN=6;BaseQRankSum=-7.191;DB;DP=180;Dels=0.00;FS=208.951;HaplotypeScore=13.8978;MLEAC=2;MLEAF=0.333;MQ=60.00;MQ0=0;MQRankSum=2.030;QD=0.52;ReadPosRankSum=1.325;SB=-2.263e+00 GT:AD:DP:GQ:PL 0/0:30,22:52:39:0,39,729 0/1:9,7:16:5:5,0,280 0/1:55,57:112:98:98,0,1169

The order of samples are father, mother and son.How is possible that the father with, respectively 30 bases REF and 22 bases VAR is called 0/0? Thanks

Giuliano