# Tagged with #pedigree 2 documentation articles | 0 announcements | 8 forum discussions

### Overview

This document describes the purpose and general principles of the Genotype Refinement workflow. For the mathematical details of the methods involved, please see the Genotype Refinement math documentation. For step-by-step instructions on how to apply this workflow to your data, please see the Genotype Refinement tutorial.

## 1. Introduction

The core GATK Best Practices workflow has historically focused on variant discovery --that is, the existence of genomic variants in one or more samples in a cohorts-- and consistently delivers high quality results when applied appropriately. However, we know that the quality of the individual genotype calls coming out of the variant callers can vary widely based on the quality of the BAM data for each sample. The goal of the Genotype Refinement workflow is to use additional data to improve the accuracy of genotype calls and to filter genotype calls that are not reliable enough for downstream analysis. In this sense it serves as an optional extension of the variant calling workflow, intended for researchers whose work requires high-quality identification of individual genotypes.

A few commonly asked questions are:

### What studies can benefit from the Genotype Refinement workflow?

While every study can benefit from increased data accuracy, this workflow is especially useful for analyses that are concerned with how many copies of each variant an individual has (e.g. in the case of loss of function) or with the transmission (or de novo origin) of a variant in a family.

### What additional data do I need to run the Genotype Refinement workflow?

If a “gold standard” dataset for SNPs is available, that can be used as a very powerful set of priors on the genotype likelihoods in your data. For analyses involving families, a pedigree file describing the relatedness of the trios in your study will provide another source of supplemental information. If neither of these applies to your data, the samples in the dataset itself can provide some degree of genotype refinement (see section 5 below for details).

### Is the Genotype Refinement workflow going to change my data? Can I still use my old analysis pipeline?

After running the Genotype Refinement workflow, several new annotations will be added to the INFO and FORMAT fields of your variants (see below), GQ fields will be updated, and genotype calls may be modified. However, the Phred-scaled genotype likelihoods (PLs) which indicate the original genotype call (the genotype candidate with PL=0) will remain untouched. Any analysis that made use of the PLs will produce the same results as before.

## 2. The Genotype Refinement workflow

### Input

Begin with recalibrated variants from VQSR at the end of the best practices pipeline. The filters applied by VQSR will be carried through the Genotype Refinement workflow.

### Step 1: Derive posterior probabilities of genotypes

#### Tool used: CalculateGenotypePosteriors

Using the Phred-scaled genotype likelihoods (PLs) for each sample, prior probabilities for a sample taking on a HomRef, Het, or HomVar genotype are applied to derive the posterior probabilities of the sample taking on each of those genotypes. A sample’s PLs were calculated by HaplotypeCaller using only the reads for that sample. By introducing additional data like the allele counts from the 1000 Genomes project and the PLs for other individuals in the sample’s pedigree trio, those estimates of genotype likelihood can be improved based on what is known about the variation of other individuals.

SNP calls from the 1000 Genomes project capture the vast majority of variation across most human populations and can provide very strong priors in many cases. At sites where most of the 1000 Genomes samples are homozygous variant with respect to the reference genome, the probability of a sample being analyzed of also being homozygous variant is very high.

For a sample for which both parent genotypes are available, the child’s genotype can be supported or invalidated by the parents’ genotypes based on Mendel’s laws of allele transmission. Even the confidence of the parents’ genotypes can be recalibrated, such as in cases where the genotypes output by HaplotypeCaller are apparent Mendelian violations.

### Step 2: Filter low quality genotypes

#### Tool used: VariantFiltration

After the posterior probabilities are calculated for each sample at each variant site, genotypes with GQ < 20 based on the posteriors are filtered out. GQ20 is widely accepted as a good threshold for genotype accuracy, indicating that there is a 99% chance that the genotype in question is correct. Tagging those low quality genotypes indicates to researchers that these genotypes may not be suitable for downstream analysis. However, as with the VQSR, a filter tag is applied, but the data is not removed from the VCF.

### Step 3: Annotate possible de novo mutations

#### Tool used: VariantAnnotator

Using the posterior genotype probabilities, possible de novo mutations are tagged. Low confidence de novos have child GQ >= 10 and AC < 4 or AF < 0.1%, whichever is more stringent for the number of samples in the dataset. High confidence de novo sites have all trio sample GQs >= 20 with the same AC/AF criterion.

### Step 4: Functional annotation of possible biological effects

#### Tool used: SnpEff (non-GATK)

Especially in the case of de novo mutation detection, analysis can benefit from the functional annotation of variants to restrict variants to exons and surrounding regulatory regions. The GATK currently does not feature integration with any functional annotation tool, but SnpEff is a useful utility that operates in a way very similar to the GATK and integrates readily with the GATK VCF output.

## 3. Output annotations

The Genotype Refinement Pipeline adds several new info- and format-level annotations to each variant. GQ fields will be updated, and genotypes calculated to be highly likely to be incorrect will be changed. The Phred-scaled genotype likelihoods (PLs) carry through the pipeline without being changed. In this way, PLs can be used to derive the original genotypes in cases where sample genotypes were changed.

### Population Priors

New INFO field annotation PG is a vector of the Phred-scaled prior probabilities of a sample at that site being HomRef, Het, and HomVar. These priors are based on the input samples themselves along with data from the supporting samples if the variant in question overlaps another in the supporting dataset.

### Phred-Scaled Posterior Probability

New FORMAT field annotation PP is the Phred-scaled posterior probability of the sample taking on each genotype for the given variant context alleles. The PPs represent a better calibrated estimate of genotype probabilities than the PLs are recommended for use in further analyses instead of the PLs.

### Genotype Quality

Current FORMAT field annotation GQ is updated based on the PPs. The calculation is the same as for GQ based on PLs.

### Joint Trio Likelihood

New FORMAT field annotation JL is the Phred-scaled joint likelihood of the posterior genotypes for the trio being incorrect. This calculation is based on the PLs produced by HaplotypeCaller (before application of priors), but the genotypes used come from the posteriors. The goal of this annotation is to be used in combination with JP to evaluate the improvement in the overall confidence in the trio’s genotypes after applying CalculateGenotypePosteriors. The calculation of the joint likelihood is given as:

$$-10*\log ( 1-GL_{mother}[\text{Posterior mother GT}] * GL_{father}[\text{Posterior father GT}] * GL_{child}[\text{Posterior child GT}] )$$

where the GLs are the genotype likelihoods in [0, 1] probability space.

### Joint Trio Posterior

New FORMAT field annotation JP is the Phred-scaled posterior probability of the output posterior genotypes for the three samples being incorrect. The calculation of the joint posterior is given as:

$$-10*\log (1-GP_{mother}[\text{Posterior mother GT}] * GP_{father}[\text{Posterior father GT}] * GP_{child}[\text{Posterior child GT}] )$$

where the GPs are the genotype posteriors in [0, 1] probability space.

### Low Genotype Quality

New FORMAT field filter lowGQ indicates samples with posterior GQ less than 20. Filtered samples tagged with lowGQ are not recommended for use in downstream analyses.

### High and Low Confidence De Novo

New INFO field annotation for sites at which at least one family has a possible de novo mutation. Following the annotation tag is a list of the children with de novo mutations. High and low confidence are output separately.

## 4. Example

Before:

1       1226231 rs13306638      G       A       167563.16       PASS    AC=2;AF=0.333;AN=6;…        GT:AD:DP:GQ:PL  0/0:11,0:11:0:0,0,249   0/0:10,0:10:24:0,24,360 1/1:0,18:18:60:889,60,0


After:

1       1226231 rs13306638      G       A       167563.16       PASS    AC=3;AF=0.500;AN=6;…PG=0,8,22;…    GT:AD:DP:GQ:JL:JP:PL:PP 0/1:11,0:11:49:2:24:0,0,249:49,0,287    0/0:10,0:10:32:2:24:0,24,360:0,32,439   1/1:0,18:18:43:2:24:889,60,0:867,43,0


The original call for the child (first sample) was HomRef with GQ0. However, given that, with high confidence, one parent is HomRef and one is HomVar, we expect the child to be heterozygous at this site. After family priors are applied, the child’s genotype is corrected and its GQ is increased from 0 to 49. Based on the allele frequency from 1000 Genomes for this site, the somewhat weaker population priors favor a HomRef call (PG=0,8,22). The combined effect of family and population priors still favors a Het call for the child.

The joint likelihood for this trio at this site is two, indicating that the genotype for one of the samples may have been changed. Specifically a low JL indicates that posterior genotype for at least one of the samples was not the most likely as predicted by the PLs. The joint posterior value for the trio is 24, which indicates that the GQ values based on the posteriors for all of the samples are at least 24. (See above for a more complete description of JL and JP.)

The Genotype Refinement Pipeline uses Bayes’s Rule to combine independent data with the genotype likelihoods derived from HaplotypeCaller, producing more accurate and confident genotype posterior probabilities. Different sites will have different combinations of priors applied based on the overlap of each site with external, supporting SNP calls and on the availability of genotype calls for the samples in each trio.

### Input-derived Population Priors

If the input VCF contains at least 10 samples, then population priors will be calculated based on the discovered allele count for every called variant.

### Supporting Population Priors

Priors derived from supporting SNP calls can only be applied at sites where the supporting calls overlap with called variants in the input VCF. The values of these priors vary based on the called reference and alternate allele counts in the supporting VCF. Higher allele counts (for ref or alt) yield stronger priors.

### Family Priors

The strongest family priors occur at sites where the called trio genotype configuration is a Mendelian violation. In such a case, each Mendelian violation configuration is penalized by a de novo mutation probability (currently 10-6). Confidence also propagates through a trio. For example, two GQ60 HomRef parents can substantially boost a low GQ HomRef child and a GQ60 HomRef child and parent can improve the GQ of the second parent. Application of family priors requires the child to be called at the site in question. If one parent has a no-call genotype, priors can still be applied, but the potential for confidence improvement is not as great as in the 3-sample case.

### Caveats

Right now family priors can only be applied to biallelic variants and population priors can only be applied to SNPs. Family priors only work for trios.

There are two types of GATK tools that are able to use pedigree (family structure) information:

### Tools that require a pedigree to operate

PhaseByTransmission and CalculateGenotypePosterior will not run without a properly formatted pedigree file. These tools are part of the Genotype Refinement workflow, which is documented here.

### Tools that are able to generate standard variant annotations

The two variant callers (HaplotypeCaller and the deprecated UnifiedGenotyper) as well as VariantAnnotator and GenotypeGVCFs are all able to use pedigree information if you request an annotation that involves population structure (e.g. Inbreeding Coefficient). To be clear though, the pedigree information is not used during the variant calling process; it is only used during the annotation step at the end.

If you already have VCF files that were called without pedigree information, and you want to add pedigree-related annotations (e.g to use Variant Quality Score Recalibration (VQSR) with the InbreedingCoefficient as a feature annotation), don't panic. Just run the latest version of the VariantAnnotator to re-annotate your variants, requesting any missing annotations, and make sure you pass your PED file to the VariantAnnotator as well. If you forget to provide the pedigree file, the tool will run successfully but pedigree-related annotations may not be generated (this behavior is different in some older versions).

The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.

For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.

No posts found with the requested search criteria.

Hi all, I am trying to run PhaseByTransmission in a trio using the merged vcf file with father, mother and child. The vcf only contains PASS variants and no triallelic/multiallelic variants. I am also providing the ped file and it looks like that one is correct. However, during the run I encounter this error and really cannot figure out why is that? Is this because of being an indel variant? I have other indels in the file prior to this one though...

ERROR MESSAGE: Error parsing line: 1 181752783 rs798209 TG TTG 151.00 PASS 1KG_AF=.;AC1=1;AC=3;AF1=0.5;AN=6;CQ=intron_variant;DP4=102,16,89,13;DP=384;ENST=ENST00000357570;ESP_AF=.;FQ=127;GN=CACNA1E;HWE=1.000000;ICF=-1.00000;INDEL;MAF=.;MQ=48;PV4=1,1,0.11,0.14;SF=0,1,2;TYPE=ins;Cohort_AF=. GT:GQ:DP:SP:PL 0/1:99:82:0:178,0,162 0/1:99:65:3:195,0,133 0/1:99:73:3:.,.,158,

Any idea? Thanks a lot!

Hi,

Are there any GATK tools that can be used to check that samples in a pedigree are not mislabelled by looking for Mendelian inconsistencies in the exome sequencing data?

Thanks

Kath

Hi all, Just to give some context: I have filtered my trio data with some scripting to only heterozygous (hets) variants that may constitute compound hets (i.e., if phase could be accurately inferred). This is essentially phasing the child data by transmission - for all the het variants seen in the child I looked at the father and mother vcfs and filtered relevant sites as follows: - each het variant in child has to be in only and exactly one of the parents, so this excludes 1) hets present in both parents (these cannot be resolved) and 2) hets not present in any parent (not interested on those as I only want to analyse compound hets); - selected genes with at least two of the above vars; - selected genes with at least one het transmitted from the paternal side and one het from the maternal side.

My question is: can I use this filtered child vcf as my input for ReadBackedPhasing? For each of my genes that feature in the child vcf after the above filtering, I want to determine whether the variants seen within the gene are in the same haplotype or not. I am just not sure if I can do the phasing at this stage - is this alright? If I had to do the phasing early on with the raw vcf, I am not sure how would I maintain the correct phasing information when applying this filtering downstream to the phased vcf (i.e., as the phasing of a het variant is relevant to the previous PASS-ing het variant in the vcf?).

Help would be appreciated! Thanks a lot, Eva

Hi,

I am trying to use PED files when I run UnifiedGeotyper, variantannotator and VariantRecalibrator (The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34, Compiled 2013/01/11 22:43:14)

if I don't add the PED file , I have some InbreedingCoeff

annotation=[SnpEff, AlleleBalance, BaseCounts, GCContent, HardyWeinberg, IndelType, AlleleBalanceBySample, MappingQualityZeroBySample]

1       14653   .       C       T       22.54   LowQual ABHet=0.842;ABHom=0.803;AC=4;AF=0.182;AN=22;BaseCounts=1,1362,0,261;BaseQRankSum=-4.292;DP=1075;Dels=0.00;FS=4.692;GC=58.42;HW=2.5;HaplotypeScore=0.0000;InbreedingCoeff=0.0530;MLEAC=3;MLEAF=0.136;MQ=6.41;MQ0=967;MQRankSum=-2.246;OND=0.164;QD=0.08;ReadPosRankSum=-0.008;SNPEFF_EFFECT=DOWNSTREAM;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=processed_transcript;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000456328;set=FilteredInAll      GT:AB:AD:DP:GQ:MQ0:PL


but when I add the PED file, I have nothing

annotation=[AlleleBalance, BaseCounts, GCContent, HardyWeinberg, IndelType, AlleleBalanceBySample, MappingQualityZeroBySample, InbreedingCoeff] pedigree=[/scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BostonHF.ped] pedigreeString=[] pedigreeValidationType=SILENT

1       14653   .       C       T       10.43   LowQual ABHom=0.792;AC=2;AF=0.091;AN=22;BaseCounts=1,1362,0,261;BaseQRankSum=-3.828;DP=1075;Dels=0.00;FS=2.373;GC=58.42;HW=10.2;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.045;MQ=6.20;MQ0=976;MQRankSum=-2.768;OND=0.190;QD=0.10;


and my PED file seems is bad format because I have this message

INFO  17:02:17,687 PedReader - Reading PED file /scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BostonHF.ped with missing fields: []
INFO  17:02:17,687 PedReader - Reading PED file /scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BostonHF.ped with missing fields: []
INFO  17:02:17,687 PedReader - Reading PED file /scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BostonHF.ped with missing fields: []
INFO  17:02:17,687 PedReader - Reading PED file /scratch/cbrc/data/release_2012_Nov/BostonHF/PED/BostonHF.ped with missing fields: []
INFO  17:02:17,770 PedReader - Phenotype is other? false
INFO  17:02:17,770 PedReader - Phenotype is other? false
INFO  17:02:17,770 PedReader - Phenotype is other? false
INFO  17:02:17,775 PedReader - Phenotype is other? false


But I have 6 columns and the name and tabulation between each column

2 MN932002 0 0 2 1
2 PNB32015 0 MN932002 1 1
4 CS934001 0 0 2 2
4 RMAO8004 0 CS934001 1 2
6 KH948045 0 0 2 2
6 AHAO8001 0 0 2 2
6 MHAO8003 0 0 2 2
A LSB32012 0 0 2 2
A SBB32010 0 LSB32012 2 2
B CMSB32011 0 0 2 2
C MKB32014 0 0 2 2
Z M88BBTPL 0 0 other -9

for the phenotype :

-9 missing 1 unaffected 2 affected

Consequently, I cannot run " VariantRecalibrator with -an InbreedingCoeff". Could you help me ?

Thanks,

Tiphaine

Hi to all

I have just started using GATK and I have few question about some tools and about the general workflow.

I have 3 exome-seq data from a trio and I have to detect rare or private variants that segregate with the disease.

From the 3 aligned bam file I procedeed with the GATK pipeline (ADDgroupInfo, MarkDup, Realign, BQSR, Unified Genotyper and variant filtration) and I generated 3 VCF file.

As now I have to use the PhaseByTrasmission tool, should I merge the 3 VCF file?

Or it was better to merge the BAM file after adding the group info and proceed with the other analysis?

And should I create my .ped file,(I visited http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped, but I couln't understand how ped file is generated) based on the read group that I have assigned?

Thanks!!!

Hello,all

while using the walker PhaseByTransmission I always get this error:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.1-12-ga99c19d):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR
##### ERROR MESSAGE: File associated with name java.io.FileReader@5cf7c5b5 is malformed: Bad PED line 1: wrong number of fields
##### ERROR ------------------------------------------------------------------------------------------


my conmmand is :

java -jar GenomeAnalysisTK-2.1-12-ga99c19d/GenomeAnalysisTK.jar -T  PhaseByTransmission -R GRCh37.fasta -V trios_457.chr22.vcf -ped trios_457.chr22.ped -pedValidationType SILENT -o o1.vcf


and my ped file is like this:

fam1    s_4     0       0       1       1       C       C       C       C       G       G
fam1    s_5     0       0       2       2       T       T       T       T       G       G
fam1    s_7     s_4     s_5     2       2       C       T       C       T       G       G


I do counted my vcf ped and map files and the result is:

-bash-4.1$head -1 trios_457.chr22.ped |wc -w 1892 #( 6 columns for info + 943*2 columns for alleles ) -bash-4.1$ wc -l trios_457.chr22.map
943
-bash-4.1\$ grep -v "#" trios_457.chr22.vcf | wc -l
943


My question is what's wrong with my my PED line?

Before there is webpage for how to convert plink ped format to vcf format. But it seems that this link disappeared.

Thank you very much in advance.

Is it possible to use PhaseByTransmission with families that are larger than a single trio? I have a family with four siblings. If I include all of the siblings in the PED I get:

PhaseByTransmission - Caution: Family BMD has 6 members; At the moment Phase By Transmission only supports trios and parent/child pairs. Family skipped.
ERROR MESSAGE: Bad input: No PED file passed or no trios found in PED file. Aborted.


And if I just include the one key trio with the proband, I get the following:

ERROR MESSAGE: Sample BMD006_R found in data sources but not in pedigree files with STRICT pedigree validation


There does not seem to be an accessible argument for relaxing the pedigree validation. Is there a way to use PhaseByTransmission with my larger family?