# Tagged with #readbackedphasing 2 documentation articles | 1 announcement | 13 forum discussions

Created 2012-07-23 23:56:34 | Updated 2012-07-23 23:56:34 | Tags: gatkdocs readbackedphasing

A new tool has been released!

Check out the documentation at ReadBackedPhasing.

Created 2012-07-23 17:15:57 | Updated 2015-05-15 21:31:15 | Tags: readbackedphasing phasing

This document describes the underlying concepts of physical phasing as applied in the ReadBackedPhasing tool. For a complete, detailed argument reference, refer to the tool documentation page.

Note that as of GATK 3.3, physical phasing is performed automatically by HaplotypeCaller when it is run in -ERC GVCF or -ERC BP_RESOLUTION mode, so post-processing variant calls with ReadBackedPhasing is no longer necessary unless you want to merge consecutive variants into MNPs.

### Underlying concepts

The biological unit of inheritance from each parent in a diploid organism is a set of single chromosomes, so that a diploid organism contains a set of pairs of corresponding chromosomes. The full sequence of each inherited chromosome is also known as a haplotype. It is critical to ascertain which variants are associated with one another in a particular individual. For example, if an individual's DNA possesses two consecutive heterozygous sites in a protein-coding sequence, there are two alternative scenarios of how these variants interact and affect the phenotype of the individual. In one scenario, they are on two different chromosomes, so each one has its own separate effect. On the other hand, if they co-occur on the same chromosome, they are thus expressed in the same protein molecule; moreover, if they are within the same codon, they are highly likely to encode an amino acid that is non-synonymous (relative to the other chromosome). The ReadBackedPhasing program serves to discover these haplotypes based on high-throughput sequencing reads.

### How it works

The first step in phasing is to call variants ("genotype calling") using a SAM/BAM file of reads aligned to the reference genome -- this results in a VCF file. Using the VCF file and the SAM/BAM reads file, the ReadBackedPhasing tool considers all reads within a Bayesian framework and attempts to find the local haplotype with the highest probability, based on the reads observed.

The local haplotype and its phasing is encoded in the VCF file as a "|" symbol (which indicates that the alleles of the genotype correspond to the same order as the alleles for the genotype at the preceding variant site). For example, the following VCF indicates that SAMP1 is heterozygous at chromosome 20 positions 332341 and 332503, and the reference base at the first position (A) is on the same chromosome of SAMP1 as the alternate base at the latter position on that chromosome (G), and vice versa (G with C):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMP1
chr20   332341  rs6076509   A   G   470.60  PASS    AB=0.46;AC=1;AF=0.50;AN=2;DB;DP=52;Dels=0.00;HRun=1;HaplotypeScore=0.98;MQ=59.11;MQ0=0;OQ=627.69;QD=12.07;SB=-145.57    GT:DP:GL:GQ 0/1:46:-79.92,-13.87,-84.22:99
chr20   332503  rs6133033   C   G   726.23  PASS    AB=0.57;AC=1;AF=0.50;AN=2;DB;DP=61;Dels=0.00;HRun=1;HaplotypeScore=0.95;MQ=60.00;MQ0=0;OQ=894.70;QD=14.67;SB=-472.75    GT:DP:GL:GQ:PQ  1|0:60:-110.83,-18.08,-149.73:99:126.93


The per-sample per-genotype PQ field is used to provide a Phred-scaled phasing quality score based on the statistical Bayesian framework employed for phasing. For cases of homozygous sites that lie in between phased heterozygous sites, these homozygous sites will be phased with the same quality as the next heterozygous site.

Note that this tool can only handle diploid data properly. If your organism of interest is polyploid or if you are working with data from pooling experiments, you should not run this tool on your data.

### More detailed aspects of semantics of phasing in the VCF format

• The "|" symbol is used for each sample to indicate that each of the alleles of the genotype in question derive from the same haplotype as each of the alleles of the genotype of the same sample in the previous NON-FILTERED variant record. That is, rows without FILTER=PASS are essentially ignored in the read-backed phasing (RBP) algorithm.

• Note that the first heterozygous genotype record in a pair of haplotypes will necessarily have a "/" - otherwise, they would be the continuation of the preceding haplotypes.

• A homozygous genotype is always "appended" to the preceding haplotype. For example, any 0/0 or 1/1 record is always converted into 0|0 and 1|1.

• RBP attempts to phase a heterozygous genotype relative the preceding HETEROZYGOUS genotype for that sample. If there is sufficient read information to deduce the two haplotypes (for that sample), then the current genotype is declared phased ("/" changed to "|") and assigned a PQ that is proportional to the estimated Phred-scaled error rate. All homozygous genotypes for that sample that lie in between the two heterozygous genotypes are also assigned the same PQ value (and remain phased).

• If RBP cannot phase the heterozygous genotype, then the genotype remains with a "/", and no PQ score is assigned. This site essentially starts a new section of haplotype for this sample.

For example, consider the following records from the VCF file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMP1   SAMP2
chr1    1   .   A   G   99  PASS    .   GT:GL:GQ    0/1:-100,0,-100:99  0/1:-100,0,-100:99
chr1    2   .   A   G   99  PASS    .   GT:GL:GQ:PQ 1|1:-100,0,-100:99:60   0|1:-100,0,-100:99:50
chr1    3   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:60   0|0:-100,0,-100:99:60
chr1    4   .   A   G   99  FAIL    .   GT:GL:GQ    0/1:-100,0,-100:99  0/1:-100,0,-100:99
chr1    5   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:70   1|0:-100,0,-100:99:60
chr1    6   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0/1:-100,0,-100:99  1|1:-100,0,-100:99:70
chr1    7   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:80   0|1:-100,0,-100:99:70
chr1    8   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:90   0|1:-100,0,-100:99:80


The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1:

AGAAA
GGGAG


And two haplotypes at positions 6-8:

AAA
GGG


And, SAMP2 has the two haplotypes at positions 1-8:

AAAAGGAA
GGAAAGGG


Note that we have excluded the non-PASS SNP call (at chr1:4), thus assuming that both samples are homozygous reference at that site.

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.

• This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

## Haplotype Caller

• Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
• Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
• Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
• Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
• Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.

## CombineGVCFs

• This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.

## GenotypeGVCFs

• This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.

• This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

## Unified Genotyper

• Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

## Variant Recalibrator

• Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

• Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

## Variant Annotator

• Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
• Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
• In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

## Combine Variants

• Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

## Select Variants

• Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
• Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.

## CalculateGenotypePosteriors

• Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
• It no longer strips out the AD annotation from genotypes.
• AC/AF/AN counts are updated after fixing genotypes.
• Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

## Indel Realigner

• Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

• Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.

## Miscellaneous

• Thanks to Phillip Dexheimer for several Queue related fixes and patches.
• Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
• Providing an empty BAM list no results in a user error.
• Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
• Logger output now goes to STDERR instead of STDOUT.
• Picard, Tribble, and Variant jars updated to version 1.107.1683.

Created 2015-08-18 10:30:39 | Updated | Tags: readbackedphasing targeted-exome-sequencing genotype-phasing

Hello, I perform variant calling on Amplicon Based Targeted Exome sequencing on Human. I use a licenced version of GATK (V2014.3-3.2.2-7-g f9cba99). I am using ReadBackedPhasing for obtaining the Phasing information. After a successful run of the above tool, it displays that Sites phased are 41. (Screenshot attached) However, in the output file I don't see any variant with a phased genotype. The ReadBackedPhasing specific arguments provided are as given below:

java -jar ~/software_stack/GenomeAnalysisTK/GenomeAnalysisTK.jar -T ReadBackedPhasing -R ~/reference_stack/reference/human.fasta -I ~/dummydir/out.bam -V ~/dummydir/Sample.vcf -o ~/dummydir/Sample_Phased.vcf

I would also like to know what would be the ideal horizontal as well as vertical coverage (as per the interval file) for Amplicon based Targeted Exome sequencing to carry out a successful Genotype phasing.

Thanks & Regards, Aswathy

Created 2015-03-24 19:04:52 | Updated | Tags: phasebytransmission readbackedphasing phasing

The 2013 "best practices" workshop slides recommend running PhaseByTransmission followed by ReadBackedPhasing --respectPhaseInput.

1. The --respectPhaseInput option is not currently listed in the documentation. Does that mean that RBP now always respects phasing in the input VCF?

2. Does (or did) --respectPhaseInput cause phased sites in the input to be assumed correct, or are they just ignored? That is, does RBP --respectPhaseInput use the partial haplotypes from the input file as evidence?

Thanks! Douglas

Created 2015-01-30 11:27:05 | Updated | Tags: haplotypecaller readbackedphasing phasing allele-counts

I'm a bit confused regarding the new GATK version and new HC-functions. I'm trying to call haplotypes in a family of plants. I call Haplotypes using haplotype caller, then I want to run Read-backed phasing on the raw vcfs and then CalculateGenotypePosterios to add pedigree information. The CalculateGenotypePosterios-Walker seems to need the format Fields AC and AN, but they are not produced by the HaplotypeCaller. They used to be in earlier HC-Versions though...(?). How can I fix this? And is this a proper workflow at all? Is Read-backed phasing needed or has it become redundant with the new HC-Version being able to do physical phasing. Would it be "enough" to run HC for phasing and CalculateGenotypePosterios to add pedigree information? Anyhow the problem of missing ac and an fields remains. I would be greatful for some help on this.

Thsi is how a raw vcf produced by HC looks like

# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GfGa4742

GSVIVT01012145001 1 . G . . END=113 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 GSVIVT01012145001 114 . C . . END=164 GT:DP:GQ:MIN_DP:PL 0/0:172:99:164:0,120,1800 GSVIVT01012145001 165 . T C, 7732.77 . DP=175;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,173,0:173:99:0|1:165_T_C:7761,521,0,7761,521,7761:0,0,165,8 GSVIVT01012145001 166 . G . . END=166 GT:DP:GQ:MIN_DP:PL 0/0:174:72:174:0,72,1080 GSVIVT01012145001 167 . T . . END=175 GT:DP:GQ:MIN_DP:PL 0/0:174:66:174:0,60,900 GSVIVT01012145001 176 . T . . END=191 GT:DP:GQ:MIN_DP:PL 0/0:174:57:173:0,57,855 GSVIVT01012145001 192 . A . . END=194 GT:DP:GQ:MIN_DP:PL 0/0:173:54:173:0,54,810 GSVIVT01012145001 195 . T . . END=199 GT:DP:GQ:MIN_DP:PL 0/0:174:51:173:0,51,765

And this is the Error Message I get

##### ERROR MESSAGE: Key AC found in VariantContext field INFO at GSVIVT01012145001:1 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

Created 2015-01-16 18:19:55 | Updated 2015-01-19 21:23:58 | Tags: phasebytransmission readbackedphasing

Hello GATK team,

Our lab is working on project involving exome sequencing for family trios and we were interested in determining the parent of origin for the trios. In one of the papers that we have come across, we found that the project team used ReadBackedPhasing first and then applied PhasedByTransmission.

I have read previous post on GATK forum and looked at the presentations which are provided by GATK team and found that the analysis is suggested to be done in a way where PhasedByTransmission step is done before ReadBackedPhasing. We are new to these tools, so if you could shed any light on how the tools work when combined, we would really appreciate it.

The step combinations which we have already have tried out are:

Created 2014-04-09 22:57:16 | Updated | Tags: readbackedphasing

Hello,

I am working with v3.1-1 and cannot get ReadBackPhasing to produce phased genotypes (i.e., 0|0, 0|1, 1|1) on a multi-sample vcf. Here is my command line:

$java -Xmx6g -jar /share/apps/gatk/3.1-1/GenomeAnalysisTK.jar \ -T ReadBackedPhasing \ -R${REF} \ -I ${INBAM} \ --variant${INVCF} \ -L ${INVCF} \ -o${OUTVCF} \ --phaseQualityThresh 20.0

When I use and older version (v2.6-4), the same commandline yields phased genotypes on the same input vcf.

The logging information indicates that many sites were successfully phased for each sample but \$ grep "|" <phased.vcf> yields nothing.

It looks like a bug in v3.1-1 or am I missing something?

Jonathan

Created 2013-08-20 08:07:15 | Updated | Tags: phasebytransmission readbackedphasing

Hi again,
I was surprised to notice that my phased VCFs produced by both phasing tools (alone or in succession) contained about 1% (PBT) or 2% (RBP) less variants than the input files; this was reproducible (PBT and RBP), and occurred with and without the -mvf option (PBT). A quick scan (SelectVariants --discordance; thanks for providing that one ;-) indicates that the missing variants are all indels (mostly insertions, from 2-20 nt); note that I haven't tested whether the phased output file lacks all indels present in the input file.

Is this the expected behaviour of both tools or am I doing something terribly wrong?
If yes, is there an option to emit these variants together with the (phased and unphased variants) in the file specified with -o (I know I could use SelectVariants --discordance to add these back in a subsequent step, but there may be a more elegant solution)?

Sorry for the trouble, can't even remember why I counted before and after (but glad I did... )
K
[GATK 2.6-5] -T PhaseByTransmission -R ../human_g1k_v37_decoy.fasta -V IN.vcf -ped Trio1.ped -o OUT.vcf -mvf MV.vcf -pedValidationType SILENT

Created 2013-08-19 12:09:43 | Updated 2013-08-19 12:11:00 | Tags: haplotypecaller readbackedphasing

Hi, I'm calling Haplotypes of a diploid genome with the HaplotypeCaller and using the Read-Backed Phasing tool afterwards. I believe that Indels can't be phased so far. My output files look like the example showed below. In the example below there is an Indel at Position 795. As you can see, the alternative alleles in position 779 and 787 belong to the same haplotype. Do the alternative alleles in position 799 and 805 also belong to that haplotype or do I only have the phasing information for regions between the Indels and have to find another method to get a continuous sequence for each haplotype? Can I also assume that that haplotype which has the "C" in position 779 and "T" in position 787 has a Deletion (of AAA) in position 795?

Created 2013-07-23 16:20:21 | Updated | Tags: phasebytransmission readbackedphasing trio genotype

I am doing a WGS project on a family with seven siblings. We have data on the mother but the father passed many years ago. I tried splitting variant recalibrated vcf file and ped file into "trios" with just the mother and a sibling (seven times) then running PhaseByTransmission on the combined vcf. The job was successfully completed but nothing appears phased (all "/," and no "|") in the output vcf. I also tried the variant recalibrated vcf file separately with ReadBackedPhasing. The job was successfully completed as well but again nothing appears phased (all "/" and no "|" or assigned "PQ" scores). The ProduceBeagleInput walker (to use Beagle for genotype refinement) appears to only support unrelated individuals and my set involves related individuals. Do you have any other suggestions for phasing incomplete "trios?" Thanks in advance!

Hi all, Just to give some context: I have filtered my trio data with some scripting to only heterozygous (hets) variants that may constitute compound hets (i.e., if phase could be accurately inferred). This is essentially phasing the child data by transmission - for all the het variants seen in the child I looked at the father and mother vcfs and filtered relevant sites as follows: - each het variant in child has to be in only and exactly one of the parents, so this excludes 1) hets present in both parents (these cannot be resolved) and 2) hets not present in any parent (not interested on those as I only want to analyse compound hets); - selected genes with at least two of the above vars; - selected genes with at least one het transmitted from the paternal side and one het from the maternal side.

My question is: can I use this filtered child vcf as my input for ReadBackedPhasing? For each of my genes that feature in the child vcf after the above filtering, I want to determine whether the variants seen within the gene are in the same haplotype or not. I am just not sure if I can do the phasing at this stage - is this alright? If I had to do the phasing early on with the raw vcf, I am not sure how would I maintain the correct phasing information when applying this filtering downstream to the phased vcf (i.e., as the phasing of a het variant is relevant to the previous PASS-ing het variant in the vcf?).

Help would be appreciated! Thanks a lot, Eva

Created 2013-03-29 20:27:26 | Updated | Tags: readbackedphasing bam

I have applied PhaseByTransmission on a trio with a ped file and now want to run ReadBackedPhasing. However, each of the trio variant calls were called from a different BAM file (as each was from a different individual). In the ReadBackedPhasing documentation it only mentions using the program with a single bam. Does this mean that I need to merge the bams for each of the three individuals into a single bam? If so, do you have any suggested programs that work well with GATK?

Created 2012-10-30 08:56:45 | Updated | Tags: phasebytransmission readbackedphasing mergeandmatchhaplotypes combining walker

Hello GATK Team,

there are currently two walkers for phasing in the GATK PhaseByTransmission and ReadBackedPhasing. Because of their different information source (PhaseByTransmission has the called VCF file, ReadBackedPhasing the BAM files) these can produce different or complementary genotypes. There used to be a walker for this job "MergeAndMatchHaplotypes" but it seems to be discontinued.

What is the current recommendation for Trios? Only use PhaseByTransmission?

Created 2012-08-23 13:24:57 | Updated 2012-08-23 13:24:57 | Tags: phasebytransmission readbackedphasing

Dear GATK team,

I'd like to be able to work through the calculations for the PQ (ReadBackedPhasing) and TP (PhaseByTransmission) values for small toy data sets. Is there an article or document anywhere that describes the algorithms used to calculate PQ and TP? Unfortunately I'm only a beginner at Java, so can't answer my questions by looking at the source code.

Thanks for all the great work you do with the GATK.

Best wishes,

Katherine

Created 2012-08-15 03:19:56 | Updated 2012-08-15 03:19:56 | Tags: readbackedphasing

Hi developers, Could you share the detailed ReadBackedPhasing algoritm documents with me? As I don't find much more about the algorithm except GATK website, but i have some questions need to read the algorithm carefully and totally understand ReadBackedPhasing.

THanks very much. Yujian