I'm analyzing seven trio exomes right now with the latest GATK (version 2.7-4-g6f46d11), and was surprised to find a large number of mendelian violations reported by PhaseByTransmission, even after eliminating low/no coverage events. Tracking down the problem, it seems that CombineVariants occasionally propagates the PL field to the new vcf file incorrectly, sometimes in a way which causes GT not to correspond to the lowest PL.
Here's an example, showing just the GT, AD, and PL columns for a few positions in one trio. For each position, the first line contains the genotypes from the original vcf file, and the second shows the genotypes from the merged file.
#CHROM POS ID REF ALT 100403001-1 100403001-1A 100403001-1B 1 5933530 rs905469 A G 0/0:37,0:0,99,1192 0/0:35,0:0,90,1101 0/0:44,0:0,117,1412 1 5933530 rs905469 A G 0/0:37,0:189,15,1192 0/0:35,0:0,90,1101 0/0:44,0:0,117,1412 1 10412636 rs4846215 A T 0/0:119,0:0,358,4297 0/0:113,0:0,337,4060 0/0:102,0:0,304,3622 1 10412636 rs4846215 A T 0/0:119,0:110,9,0 0/0:113,0:0,337,4060 0/0:102,0:0,304,3622 1 11729035 rs79974326 G C 0/0:50,0:0,141,1709 0/0:53,0:0,150,1788 0/0:71,0:0,187,2246 1 11729035 rs79974326 G C 0/0:50,0:1930,0,3851 0/0:53,0:0,150,1788 0/0:71,0:0,187,2246 1 16735764 rs182873855 G A 0/0:54,0:0,138,1691 0/0:57,0:0,153,1841 0/0:47,0:0,120,1441 1 16735764 rs182873855 G A 0/0:54,0:174,0,1691 0/0:57,0:0,153,1841 0/0:47,0:0,120,1441 1 17316577 rs77880760 G T 0/0:42,0:0,123,1470 0/0:38,0:0,111,1317 0/0:53,0:0,153,1817 1 17316577 rs77880760 G T 0/0:42,0:233,17,1470 0/0:38,0:0,111,1317 0/0:225,25:0,153,181 1 28116000 rs2294229 A G 0/0:37,0:0,105,1291 0/0:37,0:0,111,1379 0/0:30,0:0,87,1066 1 28116000 rs2294229 A G 0/0:37,0:0,105,1291 0/0:37,0:0,111,1379 0/0:30,0:1844,159,0 1 31740706 rs3753373 A G 0/0:123,0:0,349,4173 0/0:110,0:0,319,3793 0/0:111,0:0,328,3885 1 31740706 rs3753373 A G 0/0:123,0:117,6,0 0/0:110,0:0,319,3793 0/0:111,0:0,328,3885
Most genotypes are propagated correctly, and in fact, which a propagated incorrectly changes from run to run.
In my case, I'm merging files from disjoint regions, so I can work around the problem, but it would be nice if this were fixed.
I am trying to incorporate genotype likelihoods into a downstream analysis. I have two questions:
1) Why is the most likely genotype scaled to a Phred score of zero?
2) Is there a way to undo the scaling? I have seen downstream tools undo the scaling, but I don't know how they do it. Is there an equation that will return an estimated genotype likelihood from the scaled genotype likelihoods?
Thank you for your time.
Dear GATK team and community members,
I used ProduceBeagleInput to create a genotype likelihoods file, and ran beagle.jar according to the example in http://gatkforums.broadinstitute.org/discussion/43/interface-with-beagle-software. Beagle gave a warning that it is better to use a reference panel for imputing genotypes and phasing. So I downloaded the recommended reference panel (http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes.phase1_release_v3/), but Beagle requires that the alleles be in the same order on both reference and sample files. The tool to do this is check_strands.py (http://faculty.washington.edu/sguy/beagle/strand_switching/README), but it requires both sample and reference files be in .bgl format. This is a little disappointing since not being able to use the reference panel means Beagle's calculations won't be as accurate, although I'm not sure by how much.
I understand that this might be out of the scope of responsibility for the GATK team, but I will greatly appreciate if someone can provide suggestions to allow GATK's input to Beagle be phased using a reference panel. Or hopefully, the GATK team will write a tool to produce .bgl files?
The printed values missed the PL value, for examples, the format is:
GT:AD:DP:GQ:PL ['0/2', '1,0,10', '11', '8.12'] ['0/2', '211,39,0', '250', '99'] ['0/1', '10,1', '11', '14.38'] ['0/1', '4,2', '4', '24.38'] ['0/0', '27,0', '27', '78.26'] ['1/1', '164,2', '183', '99'] ['0/1', '242,1', '249', '99'] ['0/1', '225,0', '233', '99'] ['0/0', '84,5', '82', '81.18']
For every case, the PL value is missing. It happens most often when there are more than one alternative alleles.