A new tool has been released!

Check out the documentation at PhaseByTransmission.

There are two types of GATK tools that are able to use pedigree (family structure) information:

Tools that require a pedigree to operate

PhaseByTransmission and CalculateGenotypePosterior will not run without a properly formatted pedigree file. These tools are part of the Genotype Refinement workflow, which is documented here.

Tools that are able to generate standard variant annotations

The two variant callers (HaplotypeCaller and the deprecated UnifiedGenotyper) as well as VariantAnnotator and GenotypeGVCFs are all able to use pedigree information if you request an annotation that involves population structure (e.g. Inbreeding Coefficient). To be clear though, the pedigree information is not used during the variant calling process; it is only used during the annotation step at the end.

If you already have VCF files that were called without pedigree information, and you want to add pedigree-related annotations (e.g to use Variant Quality Score Recalibration (VQSR) with the InbreedingCoefficient as a feature annotation), don't panic. Just run the latest version of the VariantAnnotator to re-annotate your variants, requesting any missing annotations, and make sure you pass your PED file to the VariantAnnotator as well. If you forget to provide the pedigree file, the tool will run successfully but pedigree-related annotations may not be generated (this behavior is different in some older versions).

About the PED format

The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.

For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.

Hi, Is there anyway to phase all of the child genotypes using both the PhaseByTransmission and ReadBackedPhasing for a given family trio genotype information?

-Ruhul

Hi, I have genotyped a trio, and then phased them using PhaseByTransmission of GATK Tools. I have then parsed the output vcf file using BCFTools query command. The output file looks like as follows:


20 65288 G/T ./. ./.
20 65900 A|A A|G A|A
20 66720 ./. C/A C/A 20 68749 T|C C|C C|T
20 69094 G/A ./. ./.

The phasing information can be interpreted as Mother|Father. Now, I understand that the Child Phasing can be done by applying Mendelian Laws. But How the phasing of Father or Mother is done here? For example here "20 68749 T|C C|C C|T", every person of the trio is phased; even the father (het) without even knowing grand parents of the child. As I am a new researcher in this field, your clarification will help me a lot to understand phasing.

-Ruhul

Hi all,

Can I use ReadBackedPhasing or some other computational tools to distinguish between paternal and maternal chromosomes/reads based on pair-end DNA sequence data (fastq files)? Thank you.

I had some questions regarding how PhaseByTransmission interprets recombination events. Does the trio-based phasing explicitly model recombination events?

For two linked loci:

Parent 1: A|T G|A with haplotypes AG and TA Parent 2: T|T A|A with haplotypes TA and TA

Child 1: A|T A|A with recombinant haplotype AA and normal haplotype TA Child 2: A|T G|A with haplotypes AG and TA

It is obvious from the data for both possible trios that TA is one haplotype since both loci are homozygous in one of the parents. Can PhaseByTransmission consider all the offspring at once to identify recombination from AG to AA in parent 1? In other words does it make any sort of haplotypic inference across all the progeny, or does it consider each SNP individually and simply phase that SNP within each trio? The latter case would produce a conflict in the phasing solution for parent 1:

Parent 1: A|T A|G for Child 1 (AA must be on one chromosome and TG on the other) A|T G|A for Child 2 (AG must be on the same chromosome because parent 2 is homozygous at both loci)

Please let me know.

  Stefano

Hello! I am currently focusing on identifying denovo mutations from my trio data (parents are unaffected and child is affected). I used PhaseByTransmission and am having a list of questions. I am pretty new in the field. It would be helpful if someone helps me by clearing my doubts. I provided log of PhaseByTransmission for your kind perusal. Please note that I used default '--DeNovoPrior' as well as the --'DeNovoPrior 0.00001' (shown below).

My questions are:

  1. Is this a natural output from PhaseByTransmission (please see the result summary provided by PhaseByTransmission)?

  2. In my mendelian_violation.vcf file, all the records are unphased (Only '/'...no '|'). Is that correct? If not, then what are the reasons behind that? Rest of the records were phased properly (showing '|').

  3. Should I consider the file 'mendelian_violation.vcf' to extract all de novo variants? Also, I did not see any denovo variants that were phased. I mean, all the records show genotype info with '/'.

  4. Is that okay if I only consider PhaseByTransmission and do not run ReadBackedPhasing? Or I need to run both. I read other comments in GATK forum that users do not require to maintain any order of running these two tools. However, I am wondering if I run only PhasByTransmission, is that okay or not?

  5. May I obtain autosomal recessive and Compound heterozygous variants from PhaseByTransmission? Or, it is better to consider Phased data (obtained from PhaseByTransmission) and call another caller to retrieve the above two mutation type?

java -jar /gatk_3.3/GenomeAnalysisTK.jar -R /reference_sequence/human_g1k_v37.fasta -T PhaseByTransmission -V trio1.vcf -ped trio1.ped --DeNovoPrior 0.00001 -o trio_out.vcf --MendelianViolationsFile mendelian_violation.vcf

INFO 20:04:04,201 GenomeAnalysisEngine - Strictness is SILENT INFO 20:04:04,341 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 20:04:04,453 PedReader - Reading PED file fam_15-054.ped with missing fields: [] INFO 20:04:04,457 PedReader - Phenotype is other? false INFO 20:04:04,510 GenomeAnalysisEngine - Preparing for traversal INFO 20:04:04,530 GenomeAnalysisEngine - Done preparing for traversal INFO 20:04:04,531 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 20:04:04,531 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 20:04:04,532 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 20:04:34,824 ProgressMeter - 15:96876611 147844.0 30.0 s 3.4 m 77.5% 38.0 s 8.0 s INFO 20:04:43,701 PhaseByTransmission - Number of complete trio-genotypes: 139299 INFO 20:04:43,702 PhaseByTransmission - Number of trio-genotypes containing no call(s): 0 INFO 20:04:43,703 PhaseByTransmission - Number of trio-genotypes phased: 124651 INFO 20:04:43,703 PhaseByTransmission - Number of resulting Het/Het/Het trios: 13391 INFO 20:04:43,704 PhaseByTransmission - Number of remaining single mendelian violations in trios: 937 INFO 20:04:43,704 PhaseByTransmission - Number of remaining double mendelian violations in trios: 12 INFO 20:04:43,704 PhaseByTransmission - Number of complete pair-genotypes: 0 INFO 20:04:43,705 PhaseByTransmission - Number of pair-genotypes containing no call(s): 0 INFO 20:04:43,705 PhaseByTransmission - Number of pair-genotypes phased: 0 INFO 20:04:43,705 PhaseByTransmission - Number of resulting Het/Het pairs: 0 INFO 20:04:43,706 PhaseByTransmission - Number of remaining mendelian violations in pairs: 0 INFO 20:04:43,706 PhaseByTransmission - Number of genotypes updated: 4395 INFO 20:04:45,481 ProgressMeter - done 201351.0 40.0 s 3.4 m 100.0% 40.0 s 0.0 s INFO 20:04:45,482 ProgressMeter - Total runtime 40.95 secs, 0.68 min, 0.01 hours INFO 20:04:47,002 GATKRunReport - Uploaded run statistics report to AWS S3

Thanks in advance.

Hello, I have 10 trios (unaffected father, unaffected Mother, and affected Child) and I want to extract de novo, autosomal recessive, and com. heterozygous variants. I want to use PhaseByTransmission and I have the pedigree information. My question is, can I use PhaseByTransmission for small number of trios (10 trios)? Could it be an issue if sample size is small? Can I run PhaseByTransmission for each trio separately?

Thanks in advance. Newbee

I am using PhaseByTransmission to phase variants called from a trio (mother, father and son) contained in a VCF file. It's my first time using this tool. Do I need to follow any convention when naming my samples? Do the sample names in the VCF file need to match the Family ID, the individual ID, or a combination of the two in the PED file? Do they need to be in the same order?

Dear GATK,

I am trying to run PhaseByTransmission on a parent/child pair on a multiple vcf.gz (with --pedigreeValidationType SILENT) but, although not getting any errors, the output is quite strange and indicates that the pedigree has not been taken into account properly. Does anyone know how to pass a parent/child pair to this tool?

Here is some info on my data:

$ cat tmp_1.ped Fam1 P20__ 0 0 2 0 Fam1 P10__ 0 P20__ 1 0

$ grep CHROM $vcf #CHROM POS ID REF ALT QUAL FILTER INFOFORMAT C1B__ C2B__ C3B__ C4B__ C5B__ C6B__ C7B__ P10__ P14__ P20__ P24__P4__ P5__ P7__

INFO 10:42:44,130 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 10:42:45,414 PhaseByTransmission - Number of complete trio-genotypes: 0 INFO 10:42:45,415 PhaseByTransmission - Number of trio-genotypes containing no call(s): 0 INFO 10:42:45,415 PhaseByTransmission - Number of trio-genotypes phased: 0 INFO 10:42:45,415 PhaseByTransmission - Number of resulting Het/Het/Het trios: 0 INFO 10:42:45,415 PhaseByTransmission - Number of remaining single mendelian violations in trios: 0 INFO 10:42:45,416 PhaseByTransmission - Number of remaining double mendelian violations in trios: 0 INFO 10:42:45,416 PhaseByTransmission - Number of complete pair-genotypes: 0 INFO 10:42:45,416 PhaseByTransmission - Number of pair-genotypes containing no call(s): 0 INFO 10:42:45,416 PhaseByTransmission - Number of pair-genotypes phased: 0 INFO 10:42:45,417 PhaseByTransmission - Number of resulting Het/Het pairs: 0 INFO 10:42:45,417 PhaseByTransmission - Number of remaining mendelian violations in pairs: 0 INFO 10:42:45,417 PhaseByTransmission - Number of genotypes updated: 0 INFO 10:42:45,446 ProgressMeter - done 0.0 1.0 s 15.2 d 100.0% 1.0 s 0.0 s INFO 10:42:45,446 ProgressMeter - Total runtime 1.32 secs, 0.02 min, 0.00 hours

The 2013 "best practices" workshop slides recommend running PhaseByTransmission followed by ReadBackedPhasing --respectPhaseInput.

  1. The --respectPhaseInput option is not currently listed in the documentation. Does that mean that RBP now always respects phasing in the input VCF?

  2. Does (or did) --respectPhaseInput cause phased sites in the input to be assumed correct, or are they just ignored? That is, does RBP --respectPhaseInput use the partial haplotypes from the input file as evidence?

Thanks! Douglas

Hello GATK team,

Our lab is working on project involving exome sequencing for family trios and we were interested in determining the parent of origin for the trios. In one of the papers that we have come across, we found that the project team used ReadBackedPhasing first and then applied PhasedByTransmission.

I have read previous post on GATK forum and looked at the presentations which are provided by GATK team and found that the analysis is suggested to be done in a way where PhasedByTransmission step is done before ReadBackedPhasing. We are new to these tools, so if you could shed any light on how the tools work when combined, we would really appreciate it.

The step combinations which we have already have tried out are:

a) 1) SelectVariants 2)ReadBackedPhasing 3)PhasedByTransmission

b) 1) SelectVariants 2)PhasedByTransmission 3)ReadBackedPhasing

(EDIT: solution found and explained below, mostly an error on my end, sorry)

I have what I know is a de novo variant (validated) and GATK PhaseByTransmission refuses to see it. Here is what I am starting with in my VCF file: 7 151092903 . G A 338.83 PASS . GT:AD:DP:GQ:PL 0/0:12,0:12:36:0,36,414 0/0:20,0:20:60:0,60,669 0/1:6,15:20:99:389,0,108

So: - the father is 12 ref, 0 alt - the mother is 20 ref, 0 alt - the offspring is 6 ref, 15 alt

When I run java -Xmx2g -jar GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar -R fasta/human_g1k_v37.fasta -T PhaseByTransmission --DeNovoPrior 0.00001 -V trio1_1553_1554_1555_small.recode.vcf -ped trio1_1553_1554_1555.ped -o trio1_1553_1554_1555.vcf --MendelianViolationsFile trio1_1553_1554_1555_noMendel.tab

I get the following output VCF line: 7 151092903 . G A 338.83 PASS . GT:AD:DP:GQ:PL:TP 1|0:12,0:12:0:0,36,414:13 0|0:20,0:20:60:0,60,669:13 1|0:6,15:20:99:389,0,108:13

So the father is eventually called a het.This happens even when I set the prior to a low value of 10^-5. That does not seem like the right behavior to me, a more appropriate call would be to call both parents ref homs. The genotype likelihood certainly suggest that for a 10^-5 prior of de novo event, this would make sense.

EDIT: OK, I wish I could remove this post. I don't think I can but I can edit the answer at least. I was just misreading the genotype likelihood. The evidence in favour of a homozygous call in the father is in fact weaker than I thought. A prior of de novo calls of 5x10^-4 fixes things, and with that threshold I am getting a proper de novo call at this location. I apologize for the pointless post!

Hi again,
I was surprised to notice that my phased VCFs produced by both phasing tools (alone or in succession) contained about 1% (PBT) or 2% (RBP) less variants than the input files; this was reproducible (PBT and RBP), and occurred with and without the -mvf option (PBT). A quick scan (SelectVariants --discordance; thanks for providing that one ;-) indicates that the missing variants are all indels (mostly insertions, from 2-20 nt); note that I haven't tested whether the phased output file lacks all indels present in the input file.

Is this the expected behaviour of both tools or am I doing something terribly wrong?
If yes, is there an option to emit these variants together with the (phased and unphased variants) in the file specified with -o (I know I could use SelectVariants --discordance to add these back in a subsequent step, but there may be a more elegant solution)?

Sorry for the trouble, can't even remember why I counted before and after (but glad I did... )
[GATK 2.6-5] -T PhaseByTransmission -R ../human_g1k_v37_decoy.fasta -V IN.vcf -ped Trio1.ped -o OUT.vcf -mvf MV.vcf -pedValidationType SILENT

I am doing a WGS project on a family with seven siblings. We have data on the mother but the father passed many years ago. I tried splitting variant recalibrated vcf file and ped file into "trios" with just the mother and a sibling (seven times) then running PhaseByTransmission on the combined vcf. The job was successfully completed but nothing appears phased (all "/," and no "|") in the output vcf. I also tried the variant recalibrated vcf file separately with ReadBackedPhasing. The job was successfully completed as well but again nothing appears phased (all "/" and no "|" or assigned "PQ" scores). The ProduceBeagleInput walker (to use Beagle for genotype refinement) appears to only support unrelated individuals and my set involves related individuals. Do you have any other suggestions for phasing incomplete "trios?" Thanks in advance!

Hi all, I am trying to run PhaseByTransmission in a trio using the merged vcf file with father, mother and child. The vcf only contains PASS variants and no triallelic/multiallelic variants. I am also providing the ped file and it looks like that one is correct. However, during the run I encounter this error and really cannot figure out why is that? Is this because of being an indel variant? I have other indels in the file prior to this one though...

ERROR MESSAGE: Error parsing line: 1 181752783 rs798209 TG TTG 151.00 PASS 1KG_AF=.;AC1=1;AC=3;AF1=0.5;AN=6;CQ=intron_variant;DP4=102,16,89,13;DP=384;ENST=ENST00000357570;ESP_AF=.;FQ=127;GN=CACNA1E;HWE=1.000000;ICF=-1.00000;INDEL;MAF=.;MQ=48;PV4=1,1,0.11,0.14;SF=0,1,2;TYPE=ins;Cohort_AF=. GT:GQ:DP:SP:PL 0/1:99:82:0:178,0,162 0/1:99:65:3:195,0,133 0/1:99:73:3:.,.,158,

Any idea? Thanks a lot!

Hello Team,

I am attempting to run GATK's PhasebyTransmission command to phase a vcf file contains a father, mother, son trio generated from complete genomics mkvcf command.

After creating the ped file and running the command I generate the error: "MESSAGE: BUG: Attempted to get likelihoods as strings and neither the vector nor the string is set!". I am not exactly sure what this means.

When I check my file and the documentation I am able to see that the 'GL' field is contained in the file, but could this not be the case? I have attached a few lines from the vcf I am using.

Any help with resolving the this issue would be of great help.

Thank you


In switching to the 2.x series of GATK, I noticed that PBT now drops multi-allelic sites entirely from the output. Shouldn't the correct behavior be to write them out unmodified? Or is there a specific reason multi-allelic sites are not being written out?

Specifically, here is the current code

if (vc == null || !vc.isBiallelic())
    return metricsCounters;

But I think it should be something like this...

if (vc == null)
    return metricsCounters;
if (!vc.isBiallelic()) {
    return metricsCounters;

I'm attempting to use PhaseByTransmission

Hi to all

I began a variant analysis from 4 family related exome-seq samples in which a patology seems to be related to a polimorphism. I am just wondering which variant calling tools is better to use and if applying PhasebyTrasmission refinement is the correct way (in PhasebyTrasmission analysis does the read group that I assigned to bam file play a role in definition of the relation or I have to use just the ped file?).


while using the walker PhaseByTransmission I always get this error:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.1-12-ga99c19d): 
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: File associated with name java.io.FileReader@5cf7c5b5 is malformed: Bad PED line 1: wrong number of fields
##### ERROR ------------------------------------------------------------------------------------------

my conmmand is :

java -jar GenomeAnalysisTK-2.1-12-ga99c19d/GenomeAnalysisTK.jar -T  PhaseByTransmission -R GRCh37.fasta -V trios_457.chr22.vcf -ped trios_457.chr22.ped -pedValidationType SILENT -o o1.vcf

and my ped file is like this:

fam1    s_4     0       0       1       1       C       C       C       C       G       G
fam1    s_5     0       0       2       2       T       T       T       T       G       G
fam1    s_7     s_4     s_5     2       2       C       T       C       T       G       G

I do counted my vcf ped and map files and the result is:

-bash-4.1$ head -1 trios_457.chr22.ped |wc -w
1892         #( 6 columns for info + 943*2 columns for alleles )
-bash-4.1$ wc -l trios_457.chr22.map 
-bash-4.1$ grep -v "#" trios_457.chr22.vcf | wc -l

My question is what's wrong with my my PED line?

When I run PhaseByTransmission (use the parameter --MendelianViolationsFile), I found there is confusion code (like [[I@424ace42], [[I@3d2b710e], [I@6f0b6d81) in the field "MOTHER_AD", "FATHER_AD" and " CHILD_AD". As to other two filed "AC" and "TP", who can explain what is the meaning.


I know that PhaseByTransmission can accept a vcf file containing three samples of a trio as its input. I want to know if PhaseByTransmission can also accept three vcf files of a trio as its input?


Hello GATK Team,

there are currently two walkers for phasing in the GATK PhaseByTransmission and ReadBackedPhasing. Because of their different information source (PhaseByTransmission has the called VCF file, ReadBackedPhasing the BAM files) these can produce different or complementary genotypes. There used to be a walker for this job "MergeAndMatchHaplotypes" but it seems to be discontinued.

What is the current recommendation for Trios? Only use PhaseByTransmission?

while using PhaseByTransmission I always get this error:

INFO 23:07:16,945 PhaseByTransmission - Caution: Family F1 has 1 members; At the moment Phase By Transmission only supports trios and parent/child pairs. Family skipped.

This is the PED file:

F1 26779 31599 31600 2 2

F2 31599 0 0 2 0

F3 31600 0 0 1 0

And the command line:

java -Xmx2g -jar /home/common/GenomeAnalysisTK-2.1-13/GenomeAnalysisTK.jar \ -R /home/common/hg19/ucschg19/ucsc.hg19.fasta \ -T PhaseByTransmission \ -V 38Ind_batch01_ped_snps.raw.SNP.filtered.vcf \ -ped familys.ped \ -pedValidationType SILENT \ -o 38Ind_batch01_ped_snps.raw.SNP.filtered.phasedBT.vcf

PhaseByTransmission then only rewrites the VCF file without any phasing done. Is there something wrong with the commandline or is the PED file malformated?

Hi all,

I'd like to know if someone has tested the concordance from output of PhaseByTransmission with SNP array data.

I have calculated the genotype concordance for the most likely GT combination from the VCF obtained from unified genotyper for a family trio based on the GL values against SNP array data and then did the same for the genotypes obtained after using PhaseByTransmission and I'm seeing a drop in concordance.

Is this to be expected?


Hi all,

Has anyone else gotten the following:

java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.phasing.PhaseByTransmission.phaseTrioGenotypes(PhaseByTransmission.java:242) at org.broadinstitute.sting.gatk.walkers.phasing.PhaseByTransmission.map(PhaseByTransmission.java:306) at org.broadinstitute.sting.gatk.walkers.phasing.PhaseByTransmission.map(PhaseByTransmission.java:35) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:78) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:225) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:122) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:149) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

My command line was: java -jar GenomeAnalysisTK.jar -T PhaseByTransmission -V w01.sorted.vcf -o w01.phased.vcf -f "mom+dad=child" -R hg19.fa



Dear GATK team,

I'd like to be able to work through the calculations for the PQ (ReadBackedPhasing) and TP (PhaseByTransmission) values for small toy data sets. Is there an article or document anywhere that describes the algorithms used to calculate PQ and TP? Unfortunately I'm only a beginner at Java, so can't answer my questions by looking at the source code.

Thanks for all the great work you do with the GATK.

Best wishes,


Is it possible to use PhaseByTransmission with families that are larger than a single trio? I have a family with four siblings. If I include all of the siblings in the PED I get:

PhaseByTransmission - Caution: Family BMD has 6 members; At the moment Phase By Transmission only supports trios and parent/child pairs. Family skipped.
ERROR MESSAGE: Bad input: No PED file passed or no trios found in PED file. Aborted.

And if I just include the one key trio with the proband, I get the following:

ERROR MESSAGE: Sample BMD006_R found in data sources but not in pedigree files with STRICT pedigree validation

There does not seem to be an accessible argument for relaxing the pedigree validation. Is there a way to use PhaseByTransmission with my larger family?