Tagged with #readbackedphasing
2 documentation articles | 1 announcement | 9 forum discussions


Comments (0)

This article is part of the Best Practices workflow documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.

Once you have identified variant sites in your data and assigned genotypes for each sample at those sites, it can be useful to further refine the genotypes, e.g. by phasing, by imputation, and by identifying Mendelian violations. The GATK includes tools for phasing and identifying Mendelian violations, but does not include imputation tools. However, we do provide conversion tools to interface with BEAGLE, a popular imputation tool developed at Washington University.

Comments (2)

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.


SplitNCigarReads

  • This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

Haplotype Caller

  • Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
  • Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
  • Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
  • Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
  • Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.

CombineGVCFs

  • This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.

GenotypeGVCFs

  • This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.

SimulateReadsForVariants

  • This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

Unified Genotyper

  • Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

Variant Recalibrator

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Reduce Reads

  • Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

Variant Annotator

  • Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
  • Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
  • In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

Combine Variants

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Select Variants

  • Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
  • Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.

CalculateGenotypePosteriors

  • Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
  • It no longer strips out the AD annotation from genotypes.
  • AC/AF/AN counts are updated after fixing genotypes.
  • Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

Indel Realigner

  • Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

Read Backed Phasing

  • Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.

Miscellaneous

  • Thanks to Phillip Dexheimer for several Queue related fixes and patches.
  • Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
  • Providing an empty BAM list no results in a user error.
  • Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
  • Logger output now goes to STDERR instead of STDOUT.
  • Picard, Tribble, and Variant jars updated to version 1.107.1683.
Comments (13)

Hello,

I am working with v3.1-1 and cannot get ReadBackPhasing to produce phased genotypes (i.e., 0|0, 0|1, 1|1) on a multi-sample vcf. Here is my command line:

$ java -Xmx6g -jar /share/apps/gatk/3.1-1/GenomeAnalysisTK.jar \ -T ReadBackedPhasing \ -R ${REF} \ -I ${INBAM} \ --variant ${INVCF} \ -L ${INVCF} \ -o ${OUTVCF} \ --phaseQualityThresh 20.0

When I use and older version (v2.6-4), the same commandline yields phased genotypes on the same input vcf.

The logging information indicates that many sites were successfully phased for each sample but $ grep "|" <phased.vcf> yields nothing.

It looks like a bug in v3.1-1 or am I missing something?

Jonathan

Comments (6)

Hi again,
I was surprised to notice that my phased VCFs produced by both phasing tools (alone or in succession) contained about 1% (PBT) or 2% (RBP) less variants than the input files; this was reproducible (PBT and RBP), and occurred with and without the -mvf option (PBT). A quick scan (SelectVariants --discordance; thanks for providing that one ;-) indicates that the missing variants are all indels (mostly insertions, from 2-20 nt); note that I haven't tested whether the phased output file lacks all indels present in the input file.

Is this the expected behaviour of both tools or am I doing something terribly wrong?
If yes, is there an option to emit these variants together with the (phased and unphased variants) in the file specified with -o (I know I could use SelectVariants --discordance to add these back in a subsequent step, but there may be a more elegant solution)?

Sorry for the trouble, can't even remember why I counted before and after (but glad I did... )
K
[GATK 2.6-5] -T PhaseByTransmission -R ../human_g1k_v37_decoy.fasta -V IN.vcf -ped Trio1.ped -o OUT.vcf -mvf MV.vcf -pedValidationType SILENT

Comments (1)

Hi, I'm calling Haplotypes of a diploid genome with the HaplotypeCaller and using the Read-Backed Phasing tool afterwards. I believe that Indels can't be phased so far. My output files look like the example showed below. In the example below there is an Indel at Position 795. As you can see, the alternative alleles in position 779 and 787 belong to the same haplotype. Do the alternative alleles in position 799 and 805 also belong to that haplotype or do I only have the phasing information for regions between the Indels and have to find another method to get a continuous sequence for each haplotype? Can I also assume that that haplotype which has the "C" in position 779 and "T" in position 787 has a Deletion (of AAA) in position 795?

GSVIVT01012049001 779 . A C 5696.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.613;ClippingRankSum=0.538;DP=272;FS=4.894;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.499;QD=20.94;ReadPosRankSum=0.640 GT:AD:DP:GQ:PL:PQ 0|1:121,147:268:99:5725,0,11316:10524.93

GSVIVT01012049001 787 . A T 5671.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.203;ClippingRankSum=0.297;DP=259;FS=1.517;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.090;QD=21.90;ReadPosRankSum=0.815 GT:AD:DP:GQ:PL:PQ 0|1:117,139:256:99:5700,0,11316:10067.97

GSVIVT01012049001 795 . TAAA T 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.922;ClippingRankSum=-0.641;DP=254;FS=1.014;MLEAC=1;MLEAF=0.500;MQ=59.43;MQ0=0;MQRankSum=0.316;QD=27.70;ReadPosRankSum=1.731 GT:AD:DP:GQ:PL 0/1:114,131:245:99:5024,0,15611

GSVIVT01012049001 796 . A T 3273.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.717;ClippingRankSum=0.329;DP=245;FS=1.005;MLEAC=1;MLEAF=0.500;MQ=59.51;MQ0=0;MQRankSum=0.631;QD=13.36;ReadPosRankSum=-1.317 GT:AD:DP:GQ:PL:PQ 1|0:126,111:237:99:3302,0,11019:3862.78

GSVIVT01012049001 799 . A T 3554.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.422;ClippingRankSum=-0.037;DP=244;FS=0.483;MLEAC=1;MLEAF=0.500;MQ=59.40;MQ0=0;MQRankSum=-0.095;QD=14.57;ReadPosRankSum=0.600 GT:AD:DP:GQ:PL:PQ 0|1:111,118:229:99:3583,0,16191:3906.37

GSVIVT01012049001 805 . G A 4959.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.604;ClippingRankSum=1.035;DP=237;FS=2.951;MLEAC=1;MLEAF=0.500;MQ=59.31;MQ0=0;MQRankSum=1.228;QD=20.93;ReadPosRankSum=-0.923 GT:AD:DP:GQ:PL:PQ 0|1:99,134:233:99:4988,0,9612:8723.87

Comments (4)

I am doing a WGS project on a family with seven siblings. We have data on the mother but the father passed many years ago. I tried splitting variant recalibrated vcf file and ped file into "trios" with just the mother and a sibling (seven times) then running PhaseByTransmission on the combined vcf. The job was successfully completed but nothing appears phased (all "/," and no "|") in the output vcf. I also tried the variant recalibrated vcf file separately with ReadBackedPhasing. The job was successfully completed as well but again nothing appears phased (all "/" and no "|" or assigned "PQ" scores). The ProduceBeagleInput walker (to use Beagle for genotype refinement) appears to only support unrelated individuals and my set involves related individuals. Do you have any other suggestions for phasing incomplete "trios?" Thanks in advance!

Comments (3)

Hi all, Just to give some context: I have filtered my trio data with some scripting to only heterozygous (hets) variants that may constitute compound hets (i.e., if phase could be accurately inferred). This is essentially phasing the child data by transmission - for all the het variants seen in the child I looked at the father and mother vcfs and filtered relevant sites as follows: - each het variant in child has to be in only and exactly one of the parents, so this excludes 1) hets present in both parents (these cannot be resolved) and 2) hets not present in any parent (not interested on those as I only want to analyse compound hets); - selected genes with at least two of the above vars; - selected genes with at least one het transmitted from the paternal side and one het from the maternal side.

My question is: can I use this filtered child vcf as my input for ReadBackedPhasing? For each of my genes that feature in the child vcf after the above filtering, I want to determine whether the variants seen within the gene are in the same haplotype or not. I am just not sure if I can do the phasing at this stage - is this alright? If I had to do the phasing early on with the raw vcf, I am not sure how would I maintain the correct phasing information when applying this filtering downstream to the phased vcf (i.e., as the phasing of a het variant is relevant to the previous PASS-ing het variant in the vcf?).

Help would be appreciated! Thanks a lot, Eva

Comments (4)

I have applied PhaseByTransmission on a trio with a ped file and now want to run ReadBackedPhasing. However, each of the trio variant calls were called from a different BAM file (as each was from a different individual). In the ReadBackedPhasing documentation it only mentions using the program with a single bam. Does this mean that I need to merge the bams for each of the three individuals into a single bam? If so, do you have any suggested programs that work well with GATK?

Comments (2)

Hello GATK Team,

there are currently two walkers for phasing in the GATK PhaseByTransmission and ReadBackedPhasing. Because of their different information source (PhaseByTransmission has the called VCF file, ReadBackedPhasing the BAM files) these can produce different or complementary genotypes. There used to be a walker for this job "MergeAndMatchHaplotypes" but it seems to be discontinued.

What is the current recommendation for Trios? Only use PhaseByTransmission?

Comments (1)

Dear GATK team,

I'd like to be able to work through the calculations for the PQ (ReadBackedPhasing) and TP (PhaseByTransmission) values for small toy data sets. Is there an article or document anywhere that describes the algorithms used to calculate PQ and TP? Unfortunately I'm only a beginner at Java, so can't answer my questions by looking at the source code.

Thanks for all the great work you do with the GATK.

Best wishes,

Katherine

Comments (8)

Hi developers, Could you share the detailed ReadBackedPhasing algoritm documents with me? As I don't find much more about the algorithm except GATK website, but i have some questions need to read the algorithm carefully and totally understand ReadBackedPhasing.

THanks very much. Yujian