Dear GATK team,
I'd like to be able to work through the calculations for the PQ (ReadBackedPhasing) and TP (PhaseByTransmission) values for small toy data sets. Is there an article or document anywhere that describes the algorithms used to calculate PQ and TP? Unfortunately I'm only a beginner at Java, so can't answer my questions by looking at the source code.
Thanks for all the great work you do with the GATK.
Hello GATK Team,
there are currently two walkers for phasing in the GATK PhaseByTransmission and ReadBackedPhasing. Because of their different information source (PhaseByTransmission has the called VCF file, ReadBackedPhasing the BAM files) these can produce different or complementary genotypes. There used to be a walker for this job "MergeAndMatchHaplotypes" but it seems to be discontinued.
What is the current recommendation for Trios? Only use PhaseByTransmission?
Hi developers, Could you share the detailed ReadBackedPhasing algoritm documents with me? As I don't find much more about the algorithm except GATK website, but i have some questions need to read the algorithm carefully and totally understand ReadBackedPhasing.
THanks very much. Yujian
I am doing a WGS project on a family with seven siblings. We have data on the mother but the father passed many years ago. I tried splitting variant recalibrated vcf file and ped file into "trios" with just the mother and a sibling (seven times) then running PhaseByTransmission on the combined vcf. The job was successfully completed but nothing appears phased (all "/," and no "|") in the output vcf. I also tried the variant recalibrated vcf file separately with ReadBackedPhasing. The job was successfully completed as well but again nothing appears phased (all "/" and no "|" or assigned "PQ" scores). The ProduceBeagleInput walker (to use Beagle for genotype refinement) appears to only support unrelated individuals and my set involves related individuals. Do you have any other suggestions for phasing incomplete "trios?" Thanks in advance!
I was surprised to notice that my phased VCFs produced by both phasing tools (alone or in succession) contained about 1% (PBT) or 2% (RBP) less variants than the input files; this was reproducible (PBT and RBP), and occurred with and without the -mvf option (PBT). A quick scan (
SelectVariants --discordance; thanks for providing that one ;-) indicates that the missing variants are all indels (mostly insertions, from 2-20 nt); note that I haven't tested whether the phased output file lacks all indels present in the input file.
Is this the expected behaviour of both tools or am I doing something terribly wrong?
If yes, is there an option to emit these variants together with the (phased and unphased variants) in the file specified with -o (I know I could use SelectVariants --discordance to add these back in a subsequent step, but there may be a more elegant solution)?
Sorry for the trouble, can't even remember why I counted before and after (but glad I did... )
[GATK 2.6-5] -T PhaseByTransmission -R ../human_g1k_v37_decoy.fasta -V IN.vcf -ped Trio1.ped -o OUT.vcf -mvf MV.vcf -pedValidationType SILENT
Hi, I'm calling Haplotypes of a diploid genome with the HaplotypeCaller and using the Read-Backed Phasing tool afterwards. I believe that Indels can't be phased so far. My output files look like the example showed below. In the example below there is an Indel at Position 795. As you can see, the alternative alleles in position 779 and 787 belong to the same haplotype. Do the alternative alleles in position 799 and 805 also belong to that haplotype or do I only have the phasing information for regions between the Indels and have to find another method to get a continuous sequence for each haplotype? Can I also assume that that haplotype which has the "C" in position 779 and "T" in position 787 has a Deletion (of AAA) in position 795?
GSVIVT01012049001 779 . A C 5696.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.613;ClippingRankSum=0.538;DP=272;FS=4.894;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.499;QD=20.94;ReadPosRankSum=0.640 GT:AD:DP:GQ:PL:PQ 0|1:121,147:268:99:5725,0,11316:10524.93
GSVIVT01012049001 787 . A T 5671.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.203;ClippingRankSum=0.297;DP=259;FS=1.517;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.090;QD=21.90;ReadPosRankSum=0.815 GT:AD:DP:GQ:PL:PQ 0|1:117,139:256:99:5700,0,11316:10067.97
GSVIVT01012049001 795 . TAAA T 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.922;ClippingRankSum=-0.641;DP=254;FS=1.014;MLEAC=1;MLEAF=0.500;MQ=59.43;MQ0=0;MQRankSum=0.316;QD=27.70;ReadPosRankSum=1.731 GT:AD:DP:GQ:PL 0/1:114,131:245:99:5024,0,15611
GSVIVT01012049001 796 . A T 3273.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.717;ClippingRankSum=0.329;DP=245;FS=1.005;MLEAC=1;MLEAF=0.500;MQ=59.51;MQ0=0;MQRankSum=0.631;QD=13.36;ReadPosRankSum=-1.317 GT:AD:DP:GQ:PL:PQ 1|0:126,111:237:99:3302,0,11019:3862.78
GSVIVT01012049001 799 . A T 3554.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.422;ClippingRankSum=-0.037;DP=244;FS=0.483;MLEAC=1;MLEAF=0.500;MQ=59.40;MQ0=0;MQRankSum=-0.095;QD=14.57;ReadPosRankSum=0.600 GT:AD:DP:GQ:PL:PQ 0|1:111,118:229:99:3583,0,16191:3906.37
GSVIVT01012049001 805 . G A 4959.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.604;ClippingRankSum=1.035;DP=237;FS=2.951;MLEAC=1;MLEAF=0.500;MQ=59.31;MQ0=0;MQRankSum=1.228;QD=20.93;ReadPosRankSum=-0.923 GT:AD:DP:GQ:PL:PQ 0|1:99,134:233:99:4988,0,9612:8723.87
Hi all, Just to give some context: I have filtered my trio data with some scripting to only heterozygous (hets) variants that may constitute compound hets (i.e., if phase could be accurately inferred). This is essentially phasing the child data by transmission - for all the het variants seen in the child I looked at the father and mother vcfs and filtered relevant sites as follows: - each het variant in child has to be in only and exactly one of the parents, so this excludes 1) hets present in both parents (these cannot be resolved) and 2) hets not present in any parent (not interested on those as I only want to analyse compound hets); - selected genes with at least two of the above vars; - selected genes with at least one het transmitted from the paternal side and one het from the maternal side.
My question is: can I use this filtered child vcf as my input for ReadBackedPhasing? For each of my genes that feature in the child vcf after the above filtering, I want to determine whether the variants seen within the gene are in the same haplotype or not. I am just not sure if I can do the phasing at this stage - is this alright? If I had to do the phasing early on with the raw vcf, I am not sure how would I maintain the correct phasing information when applying this filtering downstream to the phased vcf (i.e., as the phasing of a het variant is relevant to the previous PASS-ing het variant in the vcf?).
Help would be appreciated! Thanks a lot, Eva
I have applied PhaseByTransmission on a trio with a ped file and now want to run ReadBackedPhasing. However, each of the trio variant calls were called from a different BAM file (as each was from a different individual). In the ReadBackedPhasing documentation it only mentions using the program with a single bam. Does this mean that I need to merge the bams for each of the three individuals into a single bam? If so, do you have any suggested programs that work well with GATK?