Note that as of GATK 3.3, physical phasing is performed automatically by HaplotypeCaller when it is run in
-ERC GVCF or
-ERC BP_RESOLUTION mode, so post-processing variant calls with ReadBackedPhasing is no longer necessary unless you want to merge consecutive variants into MNPs.
The biological unit of inheritance from each parent in a diploid organism is a set of single chromosomes, so that a diploid organism contains a set of pairs of corresponding chromosomes. The full sequence of each inherited chromosome is also known as a haplotype. It is critical to ascertain which variants are associated with one another in a particular individual. For example, if an individual's DNA possesses two consecutive heterozygous sites in a protein-coding sequence, there are two alternative scenarios of how these variants interact and affect the phenotype of the individual. In one scenario, they are on two different chromosomes, so each one has its own separate effect. On the other hand, if they co-occur on the same chromosome, they are thus expressed in the same protein molecule; moreover, if they are within the same codon, they are highly likely to encode an amino acid that is non-synonymous (relative to the other chromosome). The ReadBackedPhasing program serves to discover these haplotypes based on high-throughput sequencing reads.
The first step in phasing is to call variants ("genotype calling") using a SAM/BAM file of reads aligned to the reference genome -- this results in a VCF file. Using the VCF file and the SAM/BAM reads file, the ReadBackedPhasing tool considers all reads within a Bayesian framework and attempts to find the local haplotype with the highest probability, based on the reads observed.
The local haplotype and its phasing is encoded in the VCF file as a "|" symbol (which indicates that the alleles of the genotype correspond to the same order as the alleles for the genotype at the preceding variant site). For example, the following VCF indicates that SAMP1 is heterozygous at chromosome 20 positions 332341 and 332503, and the reference base at the first position (A) is on the same chromosome of SAMP1 as the alternate base at the latter position on that chromosome (G), and vice versa (G with C):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP1 chr20 332341 rs6076509 A G 470.60 PASS AB=0.46;AC=1;AF=0.50;AN=2;DB;DP=52;Dels=0.00;HRun=1;HaplotypeScore=0.98;MQ=59.11;MQ0=0;OQ=627.69;QD=12.07;SB=-145.57 GT:DP:GL:GQ 0/1:46:-79.92,-13.87,-84.22:99 chr20 332503 rs6133033 C G 726.23 PASS AB=0.57;AC=1;AF=0.50;AN=2;DB;DP=61;Dels=0.00;HRun=1;HaplotypeScore=0.95;MQ=60.00;MQ0=0;OQ=894.70;QD=14.67;SB=-472.75 GT:DP:GL:GQ:PQ 1|0:60:-110.83,-18.08,-149.73:99:126.93
The per-sample per-genotype PQ field is used to provide a Phred-scaled phasing quality score based on the statistical Bayesian framework employed for phasing. For cases of homozygous sites that lie in between phased heterozygous sites, these homozygous sites will be phased with the same quality as the next heterozygous site.
Note that this tool can only handle diploid data properly. If your organism of interest is polyploid or if you are working with data from pooling experiments, you should not run this tool on your data.
The "|" symbol is used for each sample to indicate that each of the alleles of the genotype in question derive from the same haplotype as each of the alleles of the genotype of the same sample in the previous NON-FILTERED variant record. That is, rows without FILTER=PASS are essentially ignored in the read-backed phasing (RBP) algorithm.
Note that the first heterozygous genotype record in a pair of haplotypes will necessarily have a "/" - otherwise, they would be the continuation of the preceding haplotypes.
A homozygous genotype is always "appended" to the preceding haplotype. For example, any 0/0 or 1/1 record is always converted into 0|0 and 1|1.
RBP attempts to phase a heterozygous genotype relative the preceding HETEROZYGOUS genotype for that sample. If there is sufficient read information to deduce the two haplotypes (for that sample), then the current genotype is declared phased ("/" changed to "|") and assigned a PQ that is proportional to the estimated Phred-scaled error rate. All homozygous genotypes for that sample that lie in between the two heterozygous genotypes are also assigned the same PQ value (and remain phased).
For example, consider the following records from the VCF file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP1 SAMP2 chr1 1 . A G 99 PASS . GT:GL:GQ 0/1:-100,0,-100:99 0/1:-100,0,-100:99 chr1 2 . A G 99 PASS . GT:GL:GQ:PQ 1|1:-100,0,-100:99:60 0|1:-100,0,-100:99:50 chr1 3 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:60 0|0:-100,0,-100:99:60 chr1 4 . A G 99 FAIL . GT:GL:GQ 0/1:-100,0,-100:99 0/1:-100,0,-100:99 chr1 5 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:70 1|0:-100,0,-100:99:60 chr1 6 . A G 99 PASS . GT:GL:GQ:PQ 0/1:-100,0,-100:99 1|1:-100,0,-100:99:70 chr1 7 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:80 0|1:-100,0,-100:99:70 chr1 8 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:90 0|1:-100,0,-100:99:80
The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1:
And two haplotypes at positions 6-8:
And, SAMP2 has the two haplotypes at positions 1-8:
Note that we have excluded the non-PASS SNP call (at chr1:4), thus assuming that both samples are homozygous reference at that site.
There are two types of GATK tools that are able to use pedigree (family structure) information:
The two variant callers (HaplotypeCaller and the deprecated UnifiedGenotyper) as well as VariantAnnotator and GenotypeGVCFs are all able to use pedigree information if you request an annotation that involves population structure (e.g. Inbreeding Coefficient). To be clear though, the pedigree information is not used during the variant calling process; it is only used during the annotation step at the end.
If you already have VCF files that were called without pedigree information, and you want to add pedigree-related annotations (e.g to use Variant Quality Score Recalibration (VQSR) with the InbreedingCoefficient as a feature annotation), don't panic. Just run the latest version of the VariantAnnotator to re-annotate your variants, requesting any missing annotations, and make sure you pass your PED file to the VariantAnnotator as well. If you forget to provide the pedigree file, the tool will run successfully but pedigree-related annotations may not be generated (this behavior is different in some older versions).
The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.
For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.
Not a techie person but an end user/reader of VCF. I came across the following data but could not comprehend why genotype 6 is suddenly unphased. Any help?? Thanks a lot.
ps. NOT sure if this is an appropriate question for GATK forum? Please help remove it if it isn't. I do believe
phasebytransmission was run - (and can find out more info about other tech details if/when asked)
#CHROM POS ID REF ALT FILTER INFO FORMAT child father mother 1#chr11 118957735 . G T PASS . GT:AD:DP:GQ:PL:TP 0|0 0|1 0|0 2#chr11 118957930 . T TAA,TA PASS . GT:AD:DP:GQ:PL 0/0 0/0 0/0 3#chr11 118957947 . AAAG A PASS . GT:AD:DP:GQ:PL:TP 0|0 0|0 0|0 4#chr11 118957948 . AAG A PASS . GT:AD:DP:GQ:PL:TP 1|0 0|0 1|1 5#chr11 118957949 . AG A PASS . GT:AD:DP:GQ:PL:TP 1|1 1|0 1|0 6#chr11 118957950 . G A PASS . GT:AD:DP:GQ:PL:TP 0/0 0/0 0/0
I'd be grateful for your help in understanding how GATK add phase information (PID field) in GenotypeGVCFs.
Here's an example of two lines from a VCF I'm getting for running GenotypeGVCFs on a set of samples:
chr1 8864083 . T C 462.27 . AC=4;AF=0.500;AN=8;BaseQRankSum=-4.950e-01;ClippingRankSum=0.406;DP=74;FS=0.000;MLEAC=4;MLEAF=0.500;MQ=60.00;MQRankSum=0.306;QD=6.42;ReadPosRankSum=1.54;SOR=0.674 GT:AD:DP:GQ:PL 0/1:22,12:34:99:205,0,518 0/1:10,4:14:60:60,0,220 0/1:3,3:6:71:71,0,72 0/1:10,8:18:99:158,0,238 ./.:2,0:2 chr1 8864426 . C T 5302.72 . AC=5;AF=0.500;AN=10;BaseQRankSum=0.815;ClippingRankSum=-6.270e-01;DP=316;FS=7.664;MLEAC=5;MLEAF=0.500;MQ=60.00;MQRankSum=-6.130e-01;QD=17.05;ReadPosRankSum=-1.240e-01;SOR=0.777 GT:AD:DP:GQ:PGT:PID:PL 0/1:24,66:90:99:0|1:8864426_C_T:1596,0,649 0/1:20,62:82:99:.:.:1502,0,536 0/1:24,37:61:99:0|1:8864426_C_T:921,0,647 0/1:22,41:63:99:0|1:8864426_C_T:1014,0,717 0/1:2,13:15:74:0|1:8864426_C_T:304,0,74
What's the interpretation for the phase information of the last sample in chr1:8864426? in position chr1:8864083 its genotype is not reported but in the proceeding position chr1:8864426 it is phased 0|1 WRT position chr1:8864083.
If I subsequently run SelectVariants for that sample, with --excludeNonVariants this phase information is retained but clearly looses context. So another question is if there is a way to change the PID field when running SelectVariants or VariantFiltration in case the position it is phased with is filtered?
Thanks a lot
I would expect the trio phasing to try and match whole reads from the parents into the child to infer whether a variant came from mum or dad (so to work on the BAM file or at least the output of HaplotypeCaller). But it seem that the input is a VCF file?
I know that HaplotypeCaller outputs physical phasing information in
PGT fields, so as not to confuse physical phasing with pedigree based methods, however if I wanted to use the physical phasing information in downstream tools which rely on the
| character in the
GT tag and the
PS identifier tag, is it sufficient to copy this information from the
PGT fields into the
PS tags respectively for all samples which were genotyped (I notice that the genotype in PGT and GT may not necessarily agree after you run
genotypeGVCFs - it looks like
genotypeGVCFs updates the called
GT from HaplotypeCaller, but the physical phasing information from is unchanged)?
Hi, I have genotyped a trio, and then phased them using PhaseByTransmission of GATK Tools. I have then parsed the output vcf file using BCFTools query command. The output file looks like as follows:
20 65288 G/T ./. ./.
20 65900 A|A A|G A|A
20 66720 ./. C/A C/A 20 68749 T|C C|C C|T
20 69094 G/A ./. ./.
The phasing information can be interpreted as Mother|Father. Now, I understand that the Child Phasing can be done by applying Mendelian Laws. But How the phasing of Father or Mother is done here? For example here "20 68749 T|C C|C C|T", every person of the trio is phased; even the father (het) without even knowing grand parents of the child. As I am a new researcher in this field, your clarification will help me a lot to understand phasing.
Thank you very much, -Ruhul
Hello, I perform variant calling on Amplicon Based Targeted Exome sequencing on Human. I use a licenced version of GATK (V2014.3-3.2.2-7-g f9cba99). I am using ReadBackedPhasing for obtaining the Phasing information. After a successful run of the above tool, it displays that Sites phased are 41. (Screenshot attached) However, in the output file I don't see any variant with a phased genotype. The ReadBackedPhasing specific arguments provided are as given below:
java -jar ~/software_stack/GenomeAnalysisTK/GenomeAnalysisTK.jar -T ReadBackedPhasing -R ~/reference_stack/reference/human.fasta -I ~/dummydir/out.bam -V ~/dummydir/Sample.vcf -o ~/dummydir/Sample_Phased.vcf
I would also like to know what would be the ideal horizontal as well as vertical coverage (as per the interval file) for Amplicon based Targeted Exome sequencing to carry out a successful Genotype phasing.
Thanks & Regards, Aswathy
I have been using HaplotypeCaller 3.4 on five hundred cattle genomes. I am wondering how to pass the physical phasing information, now generated by Haplotype Caller in N+1 mode, through GenotypeGVCF applying pedigree and then out to Beagle 4.0 as a vcf.gz for imputation. My goal is to make a phased reference that is as accurate as possible to be used as an imputation resource. hence I would like to exploit physical phasing information.
Do you have an example work flow? It seems that the recommendations for read-backed phasing have changed since haplotype caller 3.3 came up with the N+1 workflow.
The 2013 "best practices" workshop slides recommend running PhaseByTransmission followed by ReadBackedPhasing --respectPhaseInput.
The --respectPhaseInput option is not currently listed in the documentation. Does that mean that RBP now always respects phasing in the input VCF?
I'm a bit confused regarding the new GATK version and new HC-functions. I'm trying to call haplotypes in a family of plants. I call Haplotypes using haplotype caller, then I want to run Read-backed phasing on the raw vcfs and then CalculateGenotypePosterios to add pedigree information. The CalculateGenotypePosterios-Walker seems to need the format Fields AC and AN, but they are not produced by the HaplotypeCaller. They used to be in earlier HC-Versions though...(?). How can I fix this? And is this a proper workflow at all? Is Read-backed phasing needed or has it become redundant with the new HC-Version being able to do physical phasing. Would it be "enough" to run HC for phasing and CalculateGenotypePosterios to add pedigree information? Anyhow the problem of missing ac and an fields remains. I would be greatful for some help on this.
Thsi is how a raw vcf produced by HC looks like
GSVIVT01012145001 1 . G
And this is the Error Message I get
Hello! I was wondering if the HaplotypeScore annotation was restored for HaplotypeCaller in GATK 2.6. Does it have to be called? (It's not included in my vcf file.) Moreover, all of the GT field designations have "/" instead of "|" which according to the following would mean that the results are still unphased:
"GT genotype, encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are: / : genotype unphased | : genotype phased" http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40
Also, is there a more detailed explanation than what's on the HaplotypeScore documentation page? How is the score determined in UnifiedGenotyper? Does the score have anything to do with phasing? Also, how is phasing achieved if only the 10bps surrounding the SNP are examined, regions which likely do not include other SNPs?
I used Beagle to phase my data but for some indels, I have some probleme :
Input vcf :
2 68599872 . ATG A 14.40 PASS AC=1;AC1=1;AF=0.028
Input for beagle created by ProduceBeagleInput:
2:68599872 TG - 1.0000 0.0000 0.0000 ......
Output vcf created by BeagleOutputToVCF:
2 68599872 . ATG . 14.40 BGL_RM_WAS_- AC1=1;AF1=0.02965.....
error message by CombineVariants:
MESSAGE: Badly formed variant context at location 68599872 in contig 2. Reference length must be at most one base shorter than location size
Can you help me?