I would expect the trio phasing to try and match whole reads from the parents into the child to infer whether a variant came from mum or dad (so to work on the BAM file or at least the output of HaplotypeCaller). But it seem that the input is a VCF file?
I've been exploring de novo mutation identification in the context of a pedigree of trios. I've run the UnfiedGenotyper (UG) given all the bam files for ~25 sets of trios and it appears to identify a set of de novo mutations. When I run the HaplotypeCaller (HC) pipeline, first generating gVCF files for each individual, and then using the merged gVCF files along with the pedigree for genotype refinement and de novo mutation calling, it also finds a number of de novo mutations annotated as hi-confidence de novo mutations. When I compare the UG de novo mutations to the high confidence HC list, there's very little overlap. Many of the UG hi-confidence de novo variants are called by HC, but listed as low-confidence de novo variants, and from looking at a few examples, it would appear that the HC calls have assigned lower genotype confidence levels for the parental (non-mutated, reference) genotypes. Could it be that because the gVCF files aren't storing position-specific information for the reference (non-mutated) positions in the genome, the pedigree-type de novo mutation calling is not as accurate as it could be? Should I be generating gVCFs that include position-specific information?
Many thanks for any insights. If it would help, I could post some examples.
I have followed the GATK guidelines for trio calling, recalculating posteriors, and annotating possible de novo. However, I am confused about some of the 'hiConfDeNovo' calls.
Here is an example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT child father mother chr1 144674391 rs61810930 C T 1177.92 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=2.82;ClippingRankSum=0.306;DB;DP=71;FS=2.137;MLEAC=3;MLEAF=0.500;MQ=64.80;MQRankSum=-1.675e+00;PG=0,0,0;QD=16.59;ReadPosRankSum=-1.813e+00;SOR=1.034;hiConfDeNovo=child GT:AD:DP:GQ:JL:JP:PGT:PID:PL:PP 0/1:16,17:33:99:127:127:0|1:144674371_T_C:645,0,612:645,0,612 0/1:11,5:16:99:127:127:0|1:144674371_T_C:151,0,447:151,0,447 0/1:11,11:22:99:127:127:0|1:144674371_T_C:412,0,692:412,0,692 chr1 145037984 rs2489136 T G 1401.92 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=-4.595e+00;ClippingRankSum=0.717;DB;DP=107;FS=43.850;MLEAC=3;MLEAF=0.500;MQ=69.55;MQRankSum=-9.500e-02;PG=0,0,0;QD=13.35;ReadPosRankSum=0.032;SOR=0.296;hiConfDeNovo=child GT:AD:DP:GQ:JL:JP:PGT:PID:PL:PP 0/1:20,21:41:99:127:127:.:.:510,0,640:510,0,640 0/1:10,18:28:99:127:127:0|1:145037818_T_C:454,0,293:454,0,293 0/1:17,19:36:99:127:127:.:.:468,0,504:468,0,504 chr1 145053717 rs6670785 T C 74.13 PASS AC=1;AF=0.167;AN=6;BaseQRankSum=1.91;ClippingRankSum=-1.220e-01;DB;DP=114;FS=5.902;MLEAC=1;MLEAF=0.167;MQ=68.67;MQRankSum=0.203;PG=0,0,0;QD=3.22;ReadPosRankSum=1.01;SOR=2.712;hiConfDeNovo=child GT:AD:DP:GQ:JL:JP:PGT:PID:PL:PP 0/0:48,0:48:99:96:96:.:.:0,102,1530:0,102,1590 0/1:19,4:23:99:96:96:0|1:145053711_C_T:105,0,826:105,0,886 0/0:42,0:42:99:96:96:.:.:0,99,1485:0,99,1545 chr1 145121901 rs10752808 A G 3364.92 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=4.39;ClippingRankSum=0.353;DB;DP=173;FS=3.597;MLEAC=3;MLEAF=0.500;MQ=63.47;MQRankSum=-3.750e-01;PG=0,0,0;QD=19.45;ReadPosRankSum=0.164;SOR=0.550;hiConfDeNovo=child GT:AD:DP:GQ:JL:JP:PL:PP 0/1:20,37:57:99:127:127:1187,0,627:1187,0,627 0/1:21,24:45:99:127:127:717,0,736:717,0,736 0/1:24,47:71:99:127:127:1491,0,706:1491,0,706 chr2 91766984 rs55874359 A T 6057.92 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=0.467;ClippingRankSum=-5.510e-01;DB;DP=404;FS=23.592;MLEAC=3;MLEAF=0.500;MQ=67.76;MQRankSum=-5.800e-02;PG=0,0,0;QD=15.78;ReadPosRankSum=0.318;SOR=0.878;hiConfDeNovo=child GT:AD:DP:GQ:JL:JP:PGT:PID:PL:PP 0/1:69,62:131:99:127:127:.:.:1543,0,2330:1543,0,2330 0/1:44,40:84:99:127:127:0|1:91766961_T_C:1493,0,1698:1493,0,1698 0/1:87,82:169:99:127:127:0|1:91766961_T_C:3052,0,3450:3052,0,3450
If the all samples are called as
0/1 why is it called a deNovo event? Perhaps I am misunderstanding the definition used? This was done using gatk v3.4.
(EDIT: solution found and explained below, mostly an error on my end, sorry)
I have what I know is a de novo variant (validated) and GATK PhaseByTransmission refuses to see it. Here is what I am starting with in my VCF file: 7 151092903 . G A 338.83 PASS . GT:AD:DP:GQ:PL 0/0:12,0:12:36:0,36,414 0/0:20,0:20:60:0,60,669 0/1:6,15:20:99:389,0,108
When I run java -Xmx2g -jar GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar -R fasta/human_g1k_v37.fasta -T PhaseByTransmission --DeNovoPrior 0.00001 -V trio1_1553_1554_1555_small.recode.vcf -ped trio1_1553_1554_1555.ped -o trio1_1553_1554_1555.vcf --MendelianViolationsFile trio1_1553_1554_1555_noMendel.tab
I get the following output VCF line: 7 151092903 . G A 338.83 PASS . GT:AD:DP:GQ:PL:TP 1|0:12,0:12:0:0,36,414:13 0|0:20,0:20:60:0,60,669:13 1|0:6,15:20:99:389,0,108:13
So the father is eventually called a het.This happens even when I set the prior to a low value of 10^-5. That does not seem like the right behavior to me, a more appropriate call would be to call both parents ref homs. The genotype likelihood certainly suggest that for a 10^-5 prior of de novo event, this would make sense.
EDIT: OK, I wish I could remove this post. I don't think I can but I can edit the answer at least. I was just misreading the genotype likelihood. The evidence in favour of a homozygous call in the father is in fact weaker than I thought. A prior of de novo calls of 5x10^-4 fixes things, and with that threshold I am getting a proper de novo call at this location. I apologize for the pointless post!
I am doing a WGS project on a family with seven siblings. We have data on the mother but the father passed many years ago. I tried splitting variant recalibrated vcf file and ped file into "trios" with just the mother and a sibling (seven times) then running PhaseByTransmission on the combined vcf. The job was successfully completed but nothing appears phased (all "/," and no "|") in the output vcf. I also tried the variant recalibrated vcf file separately with ReadBackedPhasing. The job was successfully completed as well but again nothing appears phased (all "/" and no "|" or assigned "PQ" scores). The ProduceBeagleInput walker (to use Beagle for genotype refinement) appears to only support unrelated individuals and my set involves related individuals. Do you have any other suggestions for phasing incomplete "trios?" Thanks in advance!
I have exome data for a few sets of parent-offspring trios, in which offspring have phenotypically related but probably genetically different diseases. Their parents are unaffected so I am particularly interested in identifying de novo mutations. My plan was to analyse each individual separately up to the variant calling phase and then to input three (mother, father, child) analysis-ready BAMs into the UnifiedGenotyper along with a ped file. My questions are:
1) Can you tell me whether the UnifiedGenotyper uses pedigree information in the ped file to call genotypes more accurately? In other words, is this better than calling variants jointly without supplying the ped file? If not, does GATK recommend any external tools for doing this step?
2) It is better to call variants jointly using all the trios (even though they are not related and probably don't share the same disease-causing mutations)?
Is it possible to use PhaseByTransmission with families that are larger than a single trio? I have a family with four siblings. If I include all of the siblings in the PED I get:
PhaseByTransmission - Caution: Family BMD has 6 members; At the moment Phase By Transmission only supports trios and parent/child pairs. Family skipped. ERROR MESSAGE: Bad input: No PED file passed or no trios found in PED file. Aborted.
And if I just include the one key trio with the proband, I get the following:
ERROR MESSAGE: Sample BMD006_R found in data sources but not in pedigree files with STRICT pedigree validation
There does not seem to be an accessible argument for relaxing the pedigree validation. Is there a way to use PhaseByTransmission with my larger family?