Tagged with #readbackedphasing
1 documentation article | 1 announcement | 20 forum discussions

Created 2012-07-23 17:15:57 | Updated 2015-05-15 21:31:15 | Tags: readbackedphasing phasing
Comments (35)

This document describes the underlying concepts of physical phasing as applied in the ReadBackedPhasing tool. For a complete, detailed argument reference, refer to the tool documentation page.

Note that as of GATK 3.3, physical phasing is performed automatically by HaplotypeCaller when it is run in -ERC GVCF or -ERC BP_RESOLUTION mode, so post-processing variant calls with ReadBackedPhasing is no longer necessary unless you want to merge consecutive variants into MNPs.

Underlying concepts

The biological unit of inheritance from each parent in a diploid organism is a set of single chromosomes, so that a diploid organism contains a set of pairs of corresponding chromosomes. The full sequence of each inherited chromosome is also known as a haplotype. It is critical to ascertain which variants are associated with one another in a particular individual. For example, if an individual's DNA possesses two consecutive heterozygous sites in a protein-coding sequence, there are two alternative scenarios of how these variants interact and affect the phenotype of the individual. In one scenario, they are on two different chromosomes, so each one has its own separate effect. On the other hand, if they co-occur on the same chromosome, they are thus expressed in the same protein molecule; moreover, if they are within the same codon, they are highly likely to encode an amino acid that is non-synonymous (relative to the other chromosome). The ReadBackedPhasing program serves to discover these haplotypes based on high-throughput sequencing reads.

How it works

The first step in phasing is to call variants ("genotype calling") using a SAM/BAM file of reads aligned to the reference genome -- this results in a VCF file. Using the VCF file and the SAM/BAM reads file, the ReadBackedPhasing tool considers all reads within a Bayesian framework and attempts to find the local haplotype with the highest probability, based on the reads observed.

The local haplotype and its phasing is encoded in the VCF file as a "|" symbol (which indicates that the alleles of the genotype correspond to the same order as the alleles for the genotype at the preceding variant site). For example, the following VCF indicates that SAMP1 is heterozygous at chromosome 20 positions 332341 and 332503, and the reference base at the first position (A) is on the same chromosome of SAMP1 as the alternate base at the latter position on that chromosome (G), and vice versa (G with C):

chr20   332341  rs6076509   A   G   470.60  PASS    AB=0.46;AC=1;AF=0.50;AN=2;DB;DP=52;Dels=0.00;HRun=1;HaplotypeScore=0.98;MQ=59.11;MQ0=0;OQ=627.69;QD=12.07;SB=-145.57    GT:DP:GL:GQ 0/1:46:-79.92,-13.87,-84.22:99
chr20   332503  rs6133033   C   G   726.23  PASS    AB=0.57;AC=1;AF=0.50;AN=2;DB;DP=61;Dels=0.00;HRun=1;HaplotypeScore=0.95;MQ=60.00;MQ0=0;OQ=894.70;QD=14.67;SB=-472.75    GT:DP:GL:GQ:PQ  1|0:60:-110.83,-18.08,-149.73:99:126.93

The per-sample per-genotype PQ field is used to provide a Phred-scaled phasing quality score based on the statistical Bayesian framework employed for phasing. For cases of homozygous sites that lie in between phased heterozygous sites, these homozygous sites will be phased with the same quality as the next heterozygous site.

Note that this tool can only handle diploid data properly. If your organism of interest is polyploid or if you are working with data from pooling experiments, you should not run this tool on your data.

More detailed aspects of semantics of phasing in the VCF format

  • The "|" symbol is used for each sample to indicate that each of the alleles of the genotype in question derive from the same haplotype as each of the alleles of the genotype of the same sample in the previous NON-FILTERED variant record. That is, rows without FILTER=PASS are essentially ignored in the read-backed phasing (RBP) algorithm.

  • Note that the first heterozygous genotype record in a pair of haplotypes will necessarily have a "/" - otherwise, they would be the continuation of the preceding haplotypes.

  • A homozygous genotype is always "appended" to the preceding haplotype. For example, any 0/0 or 1/1 record is always converted into 0|0 and 1|1.

  • RBP attempts to phase a heterozygous genotype relative the preceding HETEROZYGOUS genotype for that sample. If there is sufficient read information to deduce the two haplotypes (for that sample), then the current genotype is declared phased ("/" changed to "|") and assigned a PQ that is proportional to the estimated Phred-scaled error rate. All homozygous genotypes for that sample that lie in between the two heterozygous genotypes are also assigned the same PQ value (and remain phased).

  • If RBP cannot phase the heterozygous genotype, then the genotype remains with a "/", and no PQ score is assigned. This site essentially starts a new section of haplotype for this sample.

For example, consider the following records from the VCF file:

chr1    1   .   A   G   99  PASS    .   GT:GL:GQ    0/1:-100,0,-100:99  0/1:-100,0,-100:99
chr1    2   .   A   G   99  PASS    .   GT:GL:GQ:PQ 1|1:-100,0,-100:99:60   0|1:-100,0,-100:99:50
chr1    3   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:60   0|0:-100,0,-100:99:60
chr1    4   .   A   G   99  FAIL    .   GT:GL:GQ    0/1:-100,0,-100:99  0/1:-100,0,-100:99
chr1    5   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:70   1|0:-100,0,-100:99:60
chr1    6   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0/1:-100,0,-100:99  1|1:-100,0,-100:99:70
chr1    7   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:80   0|1:-100,0,-100:99:70
chr1    8   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:90   0|1:-100,0,-100:99:80

The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1:


And two haplotypes at positions 6-8:


And, SAMP2 has the two haplotypes at positions 1-8:


Note that we have excluded the non-PASS SNP call (at chr1:4), thus assuming that both samples are homozygous reference at that site.

Created 2014-03-04 06:37:01 | Updated 2014-03-17 22:48:07 | Tags: readbackedphasing reducereads release-notes genotypegvcfs combinegvcfs splitncigarreads
Comments (2)

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.


  • This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

Haplotype Caller

  • Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
  • Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
  • Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
  • Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
  • Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.


  • This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.


  • This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.


  • This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

Unified Genotyper

  • Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

Variant Recalibrator

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Reduce Reads

  • Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

Variant Annotator

  • Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
  • Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
  • In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

Combine Variants

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Select Variants

  • Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
  • Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.


  • Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
  • It no longer strips out the AD annotation from genotypes.
  • AC/AF/AN counts are updated after fixing genotypes.
  • Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

Indel Realigner

  • Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

Read Backed Phasing

  • Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.


  • Thanks to Phillip Dexheimer for several Queue related fixes and patches.
  • Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
  • Providing an empty BAM list no results in a user error.
  • Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
  • Logger output now goes to STDERR instead of STDOUT.
  • Picard, Tribble, and Variant jars updated to version 1.107.1683.

Created 2016-01-14 04:05:41 | Updated | Tags: readbackedphasing snp phasing mnp
Comments (7)

Can the tool also work on MNP other than SNP? If not, is there any plan to expand?

Created 2015-12-27 00:37:35 | Updated | Tags: readbackedphasingwalker readbackedphasing vcf-format
Comments (2)

I have been trying to use ReadBackedPhasing, but its is producing an output with no phasing tags (HP tag). It works with a vcf file that is been called on a single bam file, but does not produce any output from vcf files generated from vcf-subset. In short, I would like to know what INFO/FORMAT fields it looks for while assigning HP tag.

Created 2015-12-15 23:23:57 | Updated | Tags: readbackedphasing vcf bam htsjdk
Comments (1)

I got a Contig X does not have a length field error when I run the phasing tool. By digging in more, it seems a user exception from htsjdk code, here is the trace of the coding line, has anyone experienced this issue before? Is it a problem in contig reference sequence or bad VCF file, or bad bam file?

Thanks for any help,

--------------------------------error exception raised in following code line-------------------------------------------------------------------

public SAMSequenceRecord getSAMSequenceRecord() { final String lengthString = this.getGenericFieldValue("length"); if (lengthString == null) throw new TribbleException("Contig " + this.getID() + " does not have a length field.");

which is called by RMDTrackBuilder.java validateVariantAgainstSequenceDictionary(final String name, final String descriptorName, final AbstractFeatureReader reader, final SAMSequenceDictionary dict )

in vcfContigRecords.add(contig.getSAMSequenceRecord()); -- crashed here

Created 2015-12-10 17:04:52 | Updated | Tags: readbackedphasing
Comments (34)

Run ReadBackedPhasing to have following error, anyone has any suggestions?

SamDataSource$SAMReaders - Done initializing. BAM readers: total time 0.01 RMDTrackBuilder - Writing Tribble index to disk for file ...x.vcf.idx

A USER error has occurred

ERROR Message: I/O error loading or writing tribble indexc file for x.vcf

Created 2015-11-26 10:31:36 | Updated 2015-11-26 10:32:29 | Tags: haplotypecaller readbackedphasing genotype phasing
Comments (1)

Hi, I know that HaplotypeCaller outputs physical phasing information in PG and PGT fields, so as not to confuse physical phasing with pedigree based methods, however if I wanted to use the physical phasing information in downstream tools which rely on the | character in the GT tag and the PS identifier tag, is it sufficient to copy this information from the PG and PGT fields into the GT and PS tags respectively for all samples which were genotyped (I notice that the genotype in PGT and GT may not necessarily agree after you run genotypeGVCFs - it looks like genotypeGVCFs updates the called GT from HaplotypeCaller, but the physical phasing information from is unchanged)?

Created 2015-11-20 14:59:27 | Updated 2015-11-20 15:00:44 | Tags: readbackedphasing bam
Comments (3)

I want to phase some DNA-seq data.

java -jar GenomeAnalysisTK.jar -T ReadBackedPhasing -R ref.fasta -I readnames.bam --variant test.vcf -L Chr.list -o phased_SNPs.vcf --phaseQualityThresh 20.0

My vcf file looks like this and only contains information for 1 sample








NC_024331.1 131 . G GA . PASS SAF=0.655738;DP=61 NC_024331.1 147 . C G . PASS SAF=0.320000;DP=25 NC_024331.1 422 . C A . PASS SAF=0.414545;DP=275

I previously had an error message saying my bam file did not have read names. I ran java -jar AddOrReplaceReadGroups.jar I=sorted.bam O=readnames.bam RGLB=LaneX RGPU=NONE RGSM=AnySampleName RGPL=illumina

Now I am getting an error

ERROR MESSAGE: No common samples in VCF and BAM headers, so nothing could possibly be phased!

Is there somewhere in the header of the vcf I can add AnySampleName?


Created 2015-11-03 03:39:15 | Updated | Tags: phasebytransmission readbackedphasing
Comments (6)

Hi, Is there anyway to phase all of the child genotypes using both the PhaseByTransmission and ReadBackedPhasing for a given family trio genotype information?

With Kind Regards, -Ruhul

Created 2015-08-18 10:30:39 | Updated | Tags: readbackedphasing phasing
Comments (4)

Hello, I perform variant calling on Amplicon Based Targeted Exome sequencing on Human. I use a licenced version of GATK (V2014.3-3.2.2-7-g f9cba99). I am using ReadBackedPhasing for obtaining the Phasing information. After a successful run of the above tool, it displays that Sites phased are 41. (Screenshot attached) However, in the output file I don't see any variant with a phased genotype. The ReadBackedPhasing specific arguments provided are as given below:

java -jar ~/software_stack/GenomeAnalysisTK/GenomeAnalysisTK.jar -T ReadBackedPhasing -R ~/reference_stack/reference/human.fasta -I ~/dummydir/out.bam -V ~/dummydir/Sample.vcf -o ~/dummydir/Sample_Phased.vcf

I would also like to know what would be the ideal horizontal as well as vertical coverage (as per the interval file) for Amplicon based Targeted Exome sequencing to carry out a successful Genotype phasing.

Thanks & Regards, Aswathy

Created 2015-03-24 19:04:52 | Updated | Tags: phasebytransmission readbackedphasing phasing
Comments (5)

The 2013 "best practices" workshop slides recommend running PhaseByTransmission followed by ReadBackedPhasing --respectPhaseInput.

  1. The --respectPhaseInput option is not currently listed in the documentation. Does that mean that RBP now always respects phasing in the input VCF?

  2. Does (or did) --respectPhaseInput cause phased sites in the input to be assumed correct, or are they just ignored? That is, does RBP --respectPhaseInput use the partial haplotypes from the input file as evidence?

Thanks! Douglas

Created 2015-01-30 11:27:05 | Updated | Tags: haplotypecaller readbackedphasing phasing
Comments (4)

I'm a bit confused regarding the new GATK version and new HC-functions. I'm trying to call haplotypes in a family of plants. I call Haplotypes using haplotype caller, then I want to run Read-backed phasing on the raw vcfs and then CalculateGenotypePosterios to add pedigree information. The CalculateGenotypePosterios-Walker seems to need the format Fields AC and AN, but they are not produced by the HaplotypeCaller. They used to be in earlier HC-Versions though...(?). How can I fix this? And is this a proper workflow at all? Is Read-backed phasing needed or has it become redundant with the new HC-Version being able to do physical phasing. Would it be "enough" to run HC for phasing and CalculateGenotypePosterios to add pedigree information? Anyhow the problem of missing ac and an fields remains. I would be greatful for some help on this.

Thsi is how a raw vcf produced by HC looks like













GATKCommandLine=<ID=HaplotypeCaller,Version=3.3-0-g37228af,Date="Fri Jan 30 12:04:00 CET 2015",Epoch=1422615840668,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/prj/gf-grape/project_FTC_in_crops/members/Nadia/test/GfGa4742_CGATGT_vs_candidategenes.sorted.readgroups.deduplicated.realigned.recalibrated.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/prj/gf-grape/project_FTC_in_crops/members/Nadia/amplicons_run3/GATK_new/RefSequences_all_candidate_genes.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=true bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=LINEAR variant_index_parameter=128000 logging_level=INFO log_to_file=null help=false version=false likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN graphOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=false dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[ClippingRankSumTest, DepthPerSampleHC, StrandBiasBySample] excludeAnnotation=[SpanningDeletions, TandemRepeatAnnotator, ChromosomeCounts, FisherStrand, StrandOddsRatio, QualByDepth] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=GVCF annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=true sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 minPruning=2 gcpHMM=10 includeUmappedReads=false useAllelesTrigger=false phredScaledGlobalReadMismappingRate=45 maxNumHaplotypesInPopulation=2 mergeVariantsViaLD=false doNotRunPhysicalPhasing=false pair_hmm_implementation=VECTOR_LOGLESS_CACHING keepRG=null justDetermineActiveRegions=false dontGenotype=false errorCorrectKmers=false debugGraphTransformations=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false allowCyclesInKmerGraphToGeneratePaths=false noFpga=false errorCorrectReads=false kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=1000 minReadsPerAlignmentStart=5 activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">















































































































GSVIVT01012145001 1 . G . . END=113 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 GSVIVT01012145001 114 . C . . END=164 GT:DP:GQ:MIN_DP:PL 0/0:172:99:164:0,120,1800 GSVIVT01012145001 165 . T C, 7732.77 . DP=175;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,173,0:173:99:0|1:165_T_C:7761,521,0,7761,521,7761:0,0,165,8 GSVIVT01012145001 166 . G . . END=166 GT:DP:GQ:MIN_DP:PL 0/0:174:72:174:0,72,1080 GSVIVT01012145001 167 . T . . END=175 GT:DP:GQ:MIN_DP:PL 0/0:174:66:174:0,60,900 GSVIVT01012145001 176 . T . . END=191 GT:DP:GQ:MIN_DP:PL 0/0:174:57:173:0,57,855 GSVIVT01012145001 192 . A . . END=194 GT:DP:GQ:MIN_DP:PL 0/0:173:54:173:0,54,810 GSVIVT01012145001 195 . T . . END=199 GT:DP:GQ:MIN_DP:PL 0/0:174:51:173:0,51,765

And this is the Error Message I get

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Key AC found in VariantContext field INFO at GSVIVT01012145001:1 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

Created 2015-01-16 18:19:55 | Updated 2015-01-19 21:23:58 | Tags: phasebytransmission readbackedphasing
Comments (7)

Hello GATK team,

Our lab is working on project involving exome sequencing for family trios and we were interested in determining the parent of origin for the trios. In one of the papers that we have come across, we found that the project team used ReadBackedPhasing first and then applied PhasedByTransmission.

I have read previous post on GATK forum and looked at the presentations which are provided by GATK team and found that the analysis is suggested to be done in a way where PhasedByTransmission step is done before ReadBackedPhasing. We are new to these tools, so if you could shed any light on how the tools work when combined, we would really appreciate it.

The step combinations which we have already have tried out are:

a) 1) SelectVariants 2)ReadBackedPhasing 3)PhasedByTransmission

b) 1) SelectVariants 2)PhasedByTransmission 3)ReadBackedPhasing

Created 2014-04-09 22:57:16 | Updated | Tags: readbackedphasing
Comments (16)


I am working with v3.1-1 and cannot get ReadBackPhasing to produce phased genotypes (i.e., 0|0, 0|1, 1|1) on a multi-sample vcf. Here is my command line:

$ java -Xmx6g -jar /share/apps/gatk/3.1-1/GenomeAnalysisTK.jar \ -T ReadBackedPhasing \ -R ${REF} \ -I ${INBAM} \ --variant ${INVCF} \ -L ${INVCF} \ -o ${OUTVCF} \ --phaseQualityThresh 20.0

When I use and older version (v2.6-4), the same commandline yields phased genotypes on the same input vcf.

The logging information indicates that many sites were successfully phased for each sample but $ grep "|" yields nothing.

It looks like a bug in v3.1-1 or am I missing something?


Created 2013-08-20 08:07:15 | Updated | Tags: phasebytransmission readbackedphasing
Comments (6)

Hi again,
I was surprised to notice that my phased VCFs produced by both phasing tools (alone or in succession) contained about 1% (PBT) or 2% (RBP) less variants than the input files; this was reproducible (PBT and RBP), and occurred with and without the -mvf option (PBT). A quick scan (SelectVariants --discordance; thanks for providing that one ;-) indicates that the missing variants are all indels (mostly insertions, from 2-20 nt); note that I haven't tested whether the phased output file lacks all indels present in the input file.

Is this the expected behaviour of both tools or am I doing something terribly wrong?
If yes, is there an option to emit these variants together with the (phased and unphased variants) in the file specified with -o (I know I could use SelectVariants --discordance to add these back in a subsequent step, but there may be a more elegant solution)?

Sorry for the trouble, can't even remember why I counted before and after (but glad I did... )
[GATK 2.6-5] -T PhaseByTransmission -R ../human_g1k_v37_decoy.fasta -V IN.vcf -ped Trio1.ped -o OUT.vcf -mvf MV.vcf -pedValidationType SILENT

Created 2013-08-19 12:09:43 | Updated 2013-08-19 12:11:00 | Tags: haplotypecaller readbackedphasing
Comments (1)

Hi, I'm calling Haplotypes of a diploid genome with the HaplotypeCaller and using the Read-Backed Phasing tool afterwards. I believe that Indels can't be phased so far. My output files look like the example showed below. In the example below there is an Indel at Position 795. As you can see, the alternative alleles in position 779 and 787 belong to the same haplotype. Do the alternative alleles in position 799 and 805 also belong to that haplotype or do I only have the phasing information for regions between the Indels and have to find another method to get a continuous sequence for each haplotype? Can I also assume that that haplotype which has the "C" in position 779 and "T" in position 787 has a Deletion (of AAA) in position 795?

GSVIVT01012049001 779 . A C 5696.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.613;ClippingRankSum=0.538;DP=272;FS=4.894;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.499;QD=20.94;ReadPosRankSum=0.640 GT:AD:DP:GQ:PL:PQ 0|1:121,147:268:99:5725,0,11316:10524.93

GSVIVT01012049001 787 . A T 5671.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.203;ClippingRankSum=0.297;DP=259;FS=1.517;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.090;QD=21.90;ReadPosRankSum=0.815 GT:AD:DP:GQ:PL:PQ 0|1:117,139:256:99:5700,0,11316:10067.97

GSVIVT01012049001 795 . TAAA T 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.922;ClippingRankSum=-0.641;DP=254;FS=1.014;MLEAC=1;MLEAF=0.500;MQ=59.43;MQ0=0;MQRankSum=0.316;QD=27.70;ReadPosRankSum=1.731 GT:AD:DP:GQ:PL 0/1:114,131:245:99:5024,0,15611

GSVIVT01012049001 796 . A T 3273.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.717;ClippingRankSum=0.329;DP=245;FS=1.005;MLEAC=1;MLEAF=0.500;MQ=59.51;MQ0=0;MQRankSum=0.631;QD=13.36;ReadPosRankSum=-1.317 GT:AD:DP:GQ:PL:PQ 1|0:126,111:237:99:3302,0,11019:3862.78

GSVIVT01012049001 799 . A T 3554.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.422;ClippingRankSum=-0.037;DP=244;FS=0.483;MLEAC=1;MLEAF=0.500;MQ=59.40;MQ0=0;MQRankSum=-0.095;QD=14.57;ReadPosRankSum=0.600 GT:AD:DP:GQ:PL:PQ 0|1:111,118:229:99:3583,0,16191:3906.37

GSVIVT01012049001 805 . G A 4959.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.604;ClippingRankSum=1.035;DP=237;FS=2.951;MLEAC=1;MLEAF=0.500;MQ=59.31;MQ0=0;MQRankSum=1.228;QD=20.93;ReadPosRankSum=-0.923 GT:AD:DP:GQ:PL:PQ 0|1:99,134:233:99:4988,0,9612:8723.87

Created 2013-07-23 16:20:21 | Updated | Tags: phasebytransmission readbackedphasing trio genotype
Comments (4)

I am doing a WGS project on a family with seven siblings. We have data on the mother but the father passed many years ago. I tried splitting variant recalibrated vcf file and ped file into "trios" with just the mother and a sibling (seven times) then running PhaseByTransmission on the combined vcf. The job was successfully completed but nothing appears phased (all "/," and no "|") in the output vcf. I also tried the variant recalibrated vcf file separately with ReadBackedPhasing. The job was successfully completed as well but again nothing appears phased (all "/" and no "|" or assigned "PQ" scores). The ProduceBeagleInput walker (to use Beagle for genotype refinement) appears to only support unrelated individuals and my set involves related individuals. Do you have any other suggestions for phasing incomplete "trios?" Thanks in advance!

Created 2013-07-18 09:44:30 | Updated | Tags: readbackedphasingwalker readbackedphasing pedigree
Comments (3)

Hi all, Just to give some context: I have filtered my trio data with some scripting to only heterozygous (hets) variants that may constitute compound hets (i.e., if phase could be accurately inferred). This is essentially phasing the child data by transmission - for all the het variants seen in the child I looked at the father and mother vcfs and filtered relevant sites as follows:

  • each het variant in child has to be in only and exactly one of the parents, so this excludes 1) hets present in both parents (these cannot be resolved) and 2) hets not present in any parent (not interested on those as I only want to analyse compound hets);
  • selected genes with at least two of the above vars;
  • selected genes with at least one het transmitted from the paternal side and one het from the maternal side.

My question is: can I use this filtered child vcf as my input for ReadBackedPhasing? For each of my genes that feature in the child vcf after the above filtering, I want to determine whether the variants seen within the gene are in the same haplotype or not. I am just not sure if I can do the phasing at this stage - is this alright? If I had to do the phasing early on with the raw vcf, I am not sure how would I maintain the correct phasing information when applying this filtering downstream to the phased vcf (i.e., as the phasing of a het variant is relevant to the previous PASS-ing het variant in the vcf?).

Help would be appreciated! Thanks a lot, Eva

Created 2013-03-29 20:27:26 | Updated | Tags: readbackedphasing bam
Comments (4)

I have applied PhaseByTransmission on a trio with a ped file and now want to run ReadBackedPhasing. However, each of the trio variant calls were called from a different BAM file (as each was from a different individual). In the ReadBackedPhasing documentation it only mentions using the program with a single bam. Does this mean that I need to merge the bams for each of the three individuals into a single bam? If so, do you have any suggested programs that work well with GATK?

Created 2012-10-30 08:56:45 | Updated | Tags: phasebytransmission readbackedphasing mergeandmatchhaplotypes combining walker
Comments (2)

Hello GATK Team,

there are currently two walkers for phasing in the GATK PhaseByTransmission and ReadBackedPhasing. Because of their different information source (PhaseByTransmission has the called VCF file, ReadBackedPhasing the BAM files) these can produce different or complementary genotypes. There used to be a walker for this job "MergeAndMatchHaplotypes" but it seems to be discontinued.

What is the current recommendation for Trios? Only use PhaseByTransmission?

Created 2012-08-23 13:24:57 | Updated 2012-08-23 13:24:57 | Tags: phasebytransmission readbackedphasing
Comments (1)

Dear GATK team,

I'd like to be able to work through the calculations for the PQ (ReadBackedPhasing) and TP (PhaseByTransmission) values for small toy data sets. Is there an article or document anywhere that describes the algorithms used to calculate PQ and TP? Unfortunately I'm only a beginner at Java, so can't answer my questions by looking at the source code.

Thanks for all the great work you do with the GATK.

Best wishes,


Created 2012-08-15 03:19:56 | Updated 2012-08-15 03:19:56 | Tags: readbackedphasing
Comments (8)

Hi developers, Could you share the detailed ReadBackedPhasing algoritm documents with me? As I don't find much more about the algorithm except GATK website, but i have some questions need to read the algorithm carefully and totally understand ReadBackedPhasing.

THanks very much. Yujian