I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.
Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?
Thanks very much in advance,
I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is:
16050036 16050612 16050822 16050933 16051556 16051968 16051994 16052080 16052167 16052239 16052250 16052357 etc.
whereas the initial set of sites in my final vcf file is:
16050822 16050933 16051347 16051497 16051556 16051968 16051994 16052080 16052167 16052169 16052239 16052357 etc.
First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring.
I also got this warning 3 times when running GenotypeGVCFs: WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3).
FYI, this is the command I used for GenotypeGVCFs: java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22 (only running this for chr22)
Thank you in advance, Alva
If I want the variants to be called only if they fit the following criteria:
1) Min. total coverage for consideration of heterozygous is 10.
2) Min. coverage of each of the two observed major basecalls to be called heterozygous is 5.
3) Min. percentage of each of the two observed major basecalls in order to be called heterozygous is 20.
4) Min. coverage in order for a position to be called homozygous is 6.
which command-line arguments in which tools (HaplotpyeCaller or GenotypeGVCFs) can I use to accomplish these? I cannot seem to find the proper arguments in the documentation. I apologize if I overlook.
I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).
Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.
I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.
I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.
Thank you for providing guidelines on RNA-Seq variant discovery. For our data, we are currently playing with multiple mapping methods and have noticed that 2-step alignments work "better" than 1-step alignments. By 2-step alignments, I mean using STAR as step1 and then take the unmapped from this and use another aligner (like Bowtie2) for alignment. If I use such a methodology, will there be an issue in variant calling when during splitting cigar strings I ask it convert the 255 MAPQ to another value (like 60 in the best practices example), since bowtie2 gives different MAPQ scores. Sorry if this seems like a stupid question, but I am just a little curious how such a thing might affect the variant calls. Any insights/comment on this will be greatly appreciated.
I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.
I'm having trouble running the haplotypecaller on a few of my samples. Before I go full scale on all of my samples I am attempting the pipeline on a small number of my samples. When I try running haplotypecaller on my samples individually with the command below I get a stack trace error (also below). Usually I can figure out the stack trace errors and correct them, but I am having trouble with this one. With this error, the program runs for a time then errors out (sometime it runs longer then other times). It doesn't occur for every sample and when it does occur, if I re-run that same individual and same command again it sometimes goes to completion without errors. Other times I might have to run that individual/command multiple times before I can get it to complete without errors. Any insight would be greatly appreciated.
Command: java -Xmx16g -jar ~/programs/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../miliaris_ref.fa -I realigned/27861_realigned.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --max_alternate_alleles 2 -mbq 30 -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 4.0 -stand_emit_conf 3.0 -o rawSNPS_Q4/27861_rawSNPS_Q4.vcf -nct 8 -A HaplotypeScore -A FisherStrand -A BaseQualityRankSumTest -A MappingQualityRankSumTest -A ReadPosRankSumTest -A QualByDepth -A VariantType -A LowMQ
Error: . . . INFO 12:36:15,578 ProgressMeter - comp45464_c0_seq1:533 2.82e+06 2.5 m 53.0 s 7.1% 35.2 m 32.7 m INFO 12:36:45,579 ProgressMeter - comp45609_c0_seq1:269 2.84e+06 3.0 m 63.0 s 7.2% 41.9 m 38.9 m INFO 12:37:15,580 ProgressMeter - comp48164_c0_seq1:493 3.09e+06 3.5 m 67.0 s 7.8% 44.9 m 41.4 m WARN 12:37:24,038 ExactAFCalc - this tool is currently set to genotype at most 2 alternate alleles in a given context, but the context at comp48440_c0_seq1:357 has 6 alter$ INFO 12:37:45,581 ProgressMeter - comp53167_c0_seq2:450 3.67e+06 4.0 m 65.0 s 9.2% 43.3 m 39.3 m
HI, I'd like to report a weird result from HaplotypeCaller.
We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B.
However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected)
Please see the VCF table:
There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype.
Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype.
Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter.
Hi, we observe an unexpected deletion call (GATK UnifiedGenotyper 3.0 and 2.5) in a setup with three samples called together. In the file call.txt you find the call. From the pileup (pileup.txt, position 19:57325555) we would expect, that the indel would only be called for the first sample. Is there another source of information, that makes GATK believe, that the deletion also occurs in the second and third sample?
Thanks in advance, Johannes
Dear GATK Team,
I have encountered a problem while doing a variant calling. I received raw reads from sequencing company and wanted to do some analysis myself. I did everything according to GATK best practices and everything was fine until I looked at VCF file for called variants and I did not find the variant which was reported to me from sequencing company as a disease causing mutation and therefore was the only one I especially searched for. It is a 1bp heterozygous deletion, which can be seen in IGV picture attached to this post. I have tried to change parameters for both HaplotypeCaller and UnifiedGenotyper, but nothing has made them to call this indel.
The last command I used was:
java -jar ~/NGS_programs/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ~/NGS_data/hg19/ucsc.hg19.fasta -I KO002_P.merged.dedup.realn.bam -stand_call_conf 20 -stand_emit_conf 1 -L chr6:157097064-157533913 -o ARID1B/ARID1B_UG_emit1_call20_INDEL.vcf -glm INDEL
I also have tried lowering the -minIndelCnt, -minIndelFrac arguments, and maximized the --max_deletion_fraction while using UG. But nothing helped.
Do you have any ideas what could be the problem?
Thank you in advance!
We are sequencing a set of regions that covers about 1.5 megabases in total. We're running into problems with VQSR -- VariantRecalibrator says there are too few variants to do recalibration. To give a sense of numbers, in one sample we have about 3000 SNVs and 600 indels.
We seem to have too few indels to do VQSR on them and have a couple of questions:
Can we combine multiple samples to increase the number of variants, or does VariantRecalibrator need to work on each sample individually?
If we do not use VQSR for indels, should we also avoid VQSR for the SNPs?
The other question is whether joint or batch variant calling across several samples would help us in this case?
Thanks in advance!
I have annotated my vcf file of 20 samples from Unified genotyper using the following steps.
My question is how should I proceed if I have to select rare variants (MAF<1%) for the candidate genes that I have,for each of these 20 samples?
Hi, I encounter a strange error when using UnifiedGenotyper, The command that I use is: /usr/bin/java -Xmx4g -jar /home/shengyu_ni/bin/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /mnt/genotyping/sendru/human_g1k_v37.fasta -I bqsr.bam -o haplogroup.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -alleles /mnt/genotyping/sendru/isogg.vcf -nt 4 -ploidy 1 -L /mnt/genotyping/sendru/comprey.interval_list
and the error mesage is:
java.lang.ArrayIndexOutOfBoundsException: -1 at org.broadinstitute.sting.gatk.walkers.genotyper.GeneralPloidyGenotypeLikelihoods.subsetToAlleles(GeneralPloidyGenotypeLikelihoods.java:380) at org.broadinstitute.sting.gatk.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:294) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)
I already notice that it can be some compatible problem with my file: isogg.vcf, because when I replace that file with another vcf file, it runs successfully, but I can not spot what makes the difference. so I list the header of the vcf file as the following,
additionally, this vcf file works fine with realignment
Y 2649696 M236 G T . PASS ISOGG Y 2655180 M176 C G . PASS ISOGG Y 2656127 Z12426 G A . PASS ISOGG Y 2656959 L1233 G C . PASS ISOGG
We are running GATK HaplotypeCaller on ~50 whole exome samples. We are interested in rare variants - so we ran GATK in single sample mode instead of multi sample as you recommend, however we would like to take advantage of VQSR. What would you recommend? Can we run VQSR on the output from GATK single sample?
Additionally, we are likely to run extra batches of new exome samples. Should we wait until we have them all before running them through the GATK pipeline?
Many thanks in advance.
I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant.
However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged.
I think I recall another post having a similar issue due to multithreaded use of select variants
APologies for not commenting on this post instead as I had already posted this prior to seeing the other post!
Looking closely at our vcf file produced by HaplotypeCaller we noticed disagreement between PL values and GT in some variants. For example.
1 762589 rs71507461 G C 21714.69 PASS [clipped] GT:AD:DP:GQ:PL 1/1:0,33:33:99:1464,99,0 **0/1**:32,38:70:99:**0,9,2274** **0/1**:78,42:120:99:**323,30,0** **1/1**:11,5:96:99:**123,0,324** 1/1:2,84:86:99:3923,271,0 1/1:2,104:106:99:4945,348,0 1 762592 rs71507462 C G 21714.69 PASS [clipped] GT:AD:DP:GQ:PL 1/1:0,60:32:99:1464,99,0 0/1:586,0:67:99:1471,0,2274 0/1:77,42:119:99:1667,0,6152 1/1:1,95:96:99:4462,310,0 1/1:2,85:87:99:3923,271,0 1/1:2,101:103:99:4945,348,0 1 762601 rs71507463 T C 21714.69 PASS [clipped] GT:AD:DP:GQ:PL 1/1:1675,144:32:99:1464,99,0 0/1:30,38:68:99:1471,0,2274 0/1:79,42:121:99:1667,0,495 1/1:20,24:90:99:4462,0,476 1/1:2,83:85:99:3923,271,0 1/1:3,107:110:99:4945,348,0
What is the relationship between GT and PL - I initially thought PL determined GT? What other factors go into GATK deciding on a particular GT? For example, does GATK take into account the GT of nearby variants to determine the GT of an individual variant?
Thank you in advance for your help.
I have been trying to find documentation for understanding the annotations and numbers in the VCF file. Something like below, can someone guide me how to understand/interpret the numbers? How is the quality of the variant calling for this particular indel?
AC=2; AF=0.100; AN=20; BaseQRankSum=6.161; ClippingRankSum=-2.034; DP=313; FS=5.381; InbreedingCoeff=-0.1180; MLEAC=2; MLEAF=0.100; MQ=58.49; MQ0=0; MQRankSum=-0.456; QD=1.46; ReadPosRankSum=-4.442; VQSLOD=0.348; topculprit=ReadPosRankSum
i have sequenced two exomes from two affected individuals from the same family. i have called their variants with the recommended UnifiedGenotyper protocol for each sample individually (was done a year and half ago..), and then did theintersection to find teh shared ones. i was wondering, if you recommend calling them at one step, togther? is there any added value for doing so (in term of reducing false positives, accuracy etc..).
Hi, I’m trying to use GATK for variant calling. I have had some problems preparing the hg19 reference genome. I haven’t done the alignment myself, but gotten the .bam files already done. I how ever know that this should be the same reference used for the alignment. I have downloaded the hg19 from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz and downloaded the new version of mitochondrial genome (NC_012920.1). and then tried: cat chr*.fa > hg19.fa
After this I have followed the instructions on (how to) Prepare a reference for use with BWA and GATK.
When I try to call the variants using HaplotypeCaller (as instructed on: (howto) Call variants on a diploid genome with the HaplotypeCaller):
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I 8526RU.rmdup.bam -L 20 --genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY -stand_emit_conf 10 -stand_call_conf 30 -o raw_variants8526RU.vcf
I get the error message: “Badly formed genome loc: Contig '20' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?” Can you tell me what the problem is? And how to fix this? I know there have been some similar questions considering the contigs, but I haven’t been able to solve the problem based on them.
Thank you, Sini
I'm starting to use use the HaplotypeCaller to identify variants in my exome projects, but I was wondering how it initially determines if a region has the potential to be variable. I couldn't find any useful documentation, online...so could anyone of you help me to basically understand this first step?
Thank you in advance.
We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.
We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:
scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.
In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.
A reverse example:
scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.
Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.
I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.
we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.
details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:
@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1
@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1
@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1
@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1
@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1
@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1
Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?
Does that make sense? Any advice?
I'm trying to call variants from bowtie-aligned reads, I used PrintReads with ReassignMappingQuality filter to give all reads a mapping score of 60 to replace default value of 255. However, I'm wondering if this assignment would introduce any bias in variant calling.
Thanks a lot!
I am adapting a two-year old pipeline, which includes UnifiedGenotyper to call variants. I would like to update this to the HaplotypeCaller. In the command for the UnifiedGenotyper there is an option --metrics_file. I cannot find reference to this in the GATK documentation - do you know what it is and whether I can use it with HaplotypeCaller as well?
Also, do I understand correctly that using HaplotypeCaller to call variants means you don't have to carry out the local realignment around indels step prior to this?
Thanks very much,
I just had a look at the code of the UnifiedGenotyper and how its Variant Calling algorithm is implemented (very well documented by the way :) . But now I wonder how GATK reads in the SAM file and finds out where the differences to the reference (SNPs, Indels) are that are then examined in the Variant Calling. I only see that the UnifiedGenotyper gets a set of alleles, but not where the alleles are actually determined. I have also found out that GATK is using Samtools for parsing SAM files, but have not found the point where the actual reads are parsed and processed (e.g., by using the CIGAR string). Are you maybe doing a local realignment before the actual Variant Calling from which you get the alleles?
I would appreciate if you could guide me to the right direction where this happens in the code.
I have exome data for a few sets of parent-offspring trios, in which offspring have phenotypically related but probably genetically different diseases. Their parents are unaffected so I am particularly interested in identifying de novo mutations. My plan was to analyse each individual separately up to the variant calling phase and then to input three (mother, father, child) analysis-ready BAMs into the UnifiedGenotyper along with a ped file. My questions are:
1) Can you tell me whether the UnifiedGenotyper uses pedigree information in the ped file to call genotypes more accurately? In other words, is this better than calling variants jointly without supplying the ped file? If not, does GATK recommend any external tools for doing this step?
2) It is better to call variants jointly using all the trios (even though they are not related and probably don't share the same disease-causing mutations)?
Hi to all, I ran UnifiedGenotyper of three exome seq samples and phased with familial pedigree. During the manual filtration I saw several inconsistency. For example I get this output from UnifiedGenotyper and phasing:
chr2 38525498 rs76204302 T G 66.79 . AC=2;AF=0.333;AN=6;BaseQRankSum=-7.191;DB;DP=180;Dels=0.00;FS=208.951;HaplotypeScore=13.8978;MLEAC=2;MLEAF=0.333;MQ=60.00;MQ0=0;MQRankSum=2.030;QD=0.52;ReadPosRankSum=1.325;SB=-2.263e+00 GT:AD:DP:GQ:PL 0/0:30,22:52:39:0,39,729 0/1:9,7:16:5:5,0,280 0/1:55,57:112:98:98,0,1169
The order of samples are father, mother and son.How is possible that the father with, respectively 30 bases REF and 22 bases VAR is called 0/0? Thanks