To call variants with the GATK using pedigree information, you should base your workflow on the Best Practices recommendations -- the principles detailed there all apply to pedigree analysis.
But there is one crucial addition: you should make sure to pass a pedigree file (PED file) to all GATK walkers that you use in your workflow. Some will deliver better results if they see the pedigree data.
At the moment there are two of the standard annotations affected by pedigree:
In the specific case of trios, an additional GATK walker, PhaseByTransmission, should be used to obtain trio-aware genotypes as well as phase by descent.
The annotations mentioned above have been adapted for PED files starting with GATK v.1.6. If you already have VCF files generated by an older version of the GATK or have not passed a PED file while running the UnifiedGenotyper or VariantAnnotator, you should do the following:
-G StandardAnnotation to VariantAnnotator. Make sure you pass your PED file to the VariantAnnotator as well!The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.
For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.
Dear GATK team,
I'd like to be able to work through the calculations for the PQ (ReadBackedPhasing) and TP (PhaseByTransmission) values for small toy data sets. Is there an article or document anywhere that describes the algorithms used to calculate PQ and TP? Unfortunately I'm only a beginner at Java, so can't answer my questions by looking at the source code.
Thanks for all the great work you do with the GATK.
Best wishes,
Katherine
Hello GATK Team,
there are currently two walkers for phasing in the GATK PhaseByTransmission and ReadBackedPhasing. Because of their different information source (PhaseByTransmission has the called VCF file, ReadBackedPhasing the BAM files) these can produce different or complementary genotypes. There used to be a walker for this job "MergeAndMatchHaplotypes" but it seems to be discontinued.
What is the current recommendation for Trios? Only use PhaseByTransmission?
Hi all,
I'd like to know if someone has tested the concordance from output of PhaseByTransmission with SNP array data.
I have calculated the genotype concordance for the most likely GT combination from the VCF obtained from unified genotyper for a family trio based on the GL values against SNP array data and then did the same for the genotypes obtained after using PhaseByTransmission and I'm seeing a drop in concordance.
Is this to be expected?
Thanks!
In switching to the 2.x series of GATK, I noticed that PBT now drops multi-allelic sites entirely from the output. Shouldn't the correct behavior be to write them out unmodified? Or is there a specific reason multi-allelic sites are not being written out?
Specifically, here is the current code
if (vc == null || !vc.isBiallelic())
return metricsCounters;
But I think it should be something like this...
if (vc == null)
return metricsCounters;
if (!vc.isBiallelic()) {
vcfWriter.add(vc);
return metricsCounters;
}
Hi all,
Has anyone else gotten the following:
java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.phasing.PhaseByTransmission.phaseTrioGenotypes(PhaseByTransmission.java:242) at org.broadinstitute.sting.gatk.walkers.phasing.PhaseByTransmission.map(PhaseByTransmission.java:306) at org.broadinstitute.sting.gatk.walkers.phasing.PhaseByTransmission.map(PhaseByTransmission.java:35) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:78) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:225) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:122) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:149) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
My command line was: java -jar GenomeAnalysisTK.jar -T PhaseByTransmission -V w01.sorted.vcf -o w01.phased.vcf -f "mom+dad=child" -R hg19.fa
Cheers,
Paul
Hello,
while using PhaseByTransmission I always get this error:
INFO 23:07:16,945 PhaseByTransmission - Caution: Family F1 has 1 members; At the moment Phase By Transmission only supports trios and parent/child pairs. Family skipped.
This is the PED file:
F1 26779 31599 31600 2 2
F2 31599 0 0 2 0
F3 31600 0 0 1 0
And the command line:
java -Xmx2g -jar /home/common/GenomeAnalysisTK-2.1-13/GenomeAnalysisTK.jar \ -R /home/common/hg19/ucschg19/ucsc.hg19.fasta \ -T PhaseByTransmission \ -V 38Ind_batch01_ped_snps.raw.SNP.filtered.vcf \ -ped familys.ped \ -pedValidationType SILENT \ -o 38Ind_batch01_ped_snps.raw.SNP.filtered.phasedBT.vcf
PhaseByTransmission then only rewrites the VCF file without any phasing done. Is there something wrong with the commandline or is the PED file malformated?
Is it possible to use PhaseByTransmission with families that are larger than a single trio? I have a family with four siblings. If I include all of the siblings in the PED I get:
PhaseByTransmission - Caution: Family BMD has 6 members; At the moment Phase By Transmission only supports trios and parent/child pairs. Family skipped.
ERROR MESSAGE: Bad input: No PED file passed or no trios found in PED file. Aborted.
And if I just include the one key trio with the proband, I get the following:
ERROR MESSAGE: Sample BMD006_R found in data sources but not in pedigree files with STRICT pedigree validation
There does not seem to be an accessible argument for relaxing the pedigree validation. Is there a way to use PhaseByTransmission with my larger family?
Hi to all
I began a variant analysis from 4 family related exome-seq samples in which a patology seems to be related to a polimorphism. I am just wondering which variant calling tools is better to use and if applying PhasebyTrasmission refinement is the correct way (in PhasebyTrasmission analysis does the read group that I assigned to bam file play a role in definition of the relation or I have to use just the ped file?).
Best
Giuliano
Hello,all
while using the walker PhaseByTransmission I always get this error:
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.1-12-ga99c19d):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: File associated with name java.io.FileReader@5cf7c5b5 is malformed: Bad PED line 1: wrong number of fields
##### ERROR ------------------------------------------------------------------------------------------
my conmmand is :
java -jar GenomeAnalysisTK-2.1-12-ga99c19d/GenomeAnalysisTK.jar -T PhaseByTransmission -R GRCh37.fasta -V trios_457.chr22.vcf -ped trios_457.chr22.ped -pedValidationType SILENT -o o1.vcf
and my ped file is like this:
fam1 s_4 0 0 1 1 C C C C G G
fam1 s_5 0 0 2 2 T T T T G G
fam1 s_7 s_4 s_5 2 2 C T C T G G
I do counted my vcf ped and map files and the result is:
-bash-4.1$ head -1 trios_457.chr22.ped |wc -w
1892 #( 6 columns for info + 943*2 columns for alleles )
-bash-4.1$ wc -l trios_457.chr22.map
943
-bash-4.1$ grep -v "#" trios_457.chr22.vcf | wc -l
943
My question is what's wrong with my my PED line?
Hi,
When I run PhaseByTransmission (use the parameter --MendelianViolationsFile), I found there is confusion code (like [[I@424ace42], [[I@3d2b710e], [I@6f0b6d81) in the field "MOTHER_AD", "FATHER_AD" and " CHILD_AD". As to other two filed "AC" and "TP", who can explain what is the meaning.
Thanks.
I know that PhaseByTransmission can accept a vcf file containing three samples of a trio as its input. I want to know if PhaseByTransmission can also accept three vcf files of a trio as its input?
Thanks