Once you have identified variant sites in your data and assigned genotypes for each sample at those sites, it can be useful to further refine the genotypes, e.g. by phasing, by imputation, and by identifying Mendelian violations. The GATK includes tools for phasing and identifying Mendelian violations, but does not include imputation tools. However, we do provide conversion tools to interface with BEAGLE, a popular imputation tool developed at Washington University.
Once you have generated and filtered your callset according to our recommendations, you have several options for evaluating and refining the variant and genotype calls further, before moving on with your study. We do not provide comprehensive guidelines for this step because different studies will have different requirements, but we do provide tools and general advice.
The main options available through GATK are:
Some of the tools involved in those processes (such as SnpEff and BEAGLE) are third-party tools that are not part of GATK. We recommend those specific tools because we are most familiar with them, but there may be alternative software that would work just as well or even better with your particular data. Also, please understand that we cannot provide support for those tools; if you have any problems with them we suggest you seek out the corresponding documentation form the tool project websites, and contact their developers with any questions you may have.
For a complete, detailed argument reference, refer to the GATK document page here
The biological unit of inheritance from each parent in a diploid organism is a set of single chromosomes, so that a diploid organism contains a set of pairs of corresponding chromosomes. The full sequence of each inherited chromosome is also known as a haplotype. It is critical to ascertain which variants are associated with one another in a particular individual. For example, if an individual's DNA possesses two consecutive heterozygous sites in a protein-coding sequence, there are two alternative scenarios of how these variants interact and affect the phenotype of the individual. In one scenario, they are on two different chromosomes, so each one has its own separate effect. On the other hand, if they co-occur on the same chromosome, they are thus expressed in the same protein molecule; moreover, if they are within the same codon, they are highly likely to encode an amino acid that is non-synonymous (relative to the other chromosome). The ReadBackedPhasing program serves to discover these haplotypes based on high-throughput sequencing reads.
The first step in phasing is to call variants ("genotype calling") using a SAM/BAM file of reads aligned to the reference genome -- this results in a VCF file. Using the VCF file and the SAM/BAM reads file, the ReadBackedPhasing tool considers all reads within a Bayesian framework and attempts to find the local haplotype with the highest probability, based on the reads observed.
The local haplotype and its phasing is encoded in the VCF file as a "|" symbol (which indicates that the alleles of the genotype correspond to the same order as the alleles for the genotype at the preceding variant site). For example, the following VCF indicates that SAMP1 is heterozygous at chromosome 20 positions 332341 and 332503, and the reference base at the first position (A) is on the same chromosome of SAMP1 as the alternate base at the latter position on that chromosome (G), and vice versa (G with C):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP1 chr20 332341 rs6076509 A G 470.60 PASS AB=0.46;AC=1;AF=0.50;AN=2;DB;DP=52;Dels=0.00;HRun=1;HaplotypeScore=0.98;MQ=59.11;MQ0=0;OQ=627.69;QD=12.07;SB=-145.57 GT:DP:GL:GQ 0/1:46:-79.92,-13.87,-84.22:99 chr20 332503 rs6133033 C G 726.23 PASS AB=0.57;AC=1;AF=0.50;AN=2;DB;DP=61;Dels=0.00;HRun=1;HaplotypeScore=0.95;MQ=60.00;MQ0=0;OQ=894.70;QD=14.67;SB=-472.75 GT:DP:GL:GQ:PQ 1|0:60:-110.83,-18.08,-149.73:99:126.93
The per-sample per-genotype PQ field is used to provide a Phred-scaled phasing quality score based on the statistical Bayesian framework employed for phasing. Note that for cases of homozygous sites that lie in between phased heterozygous sites, these homozygous sites will be phased with the same quality as the next heterozygous site.
For example, consider the following records from the VCF file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP1 SAMP2 chr1 1 . A G 99 PASS . GT:GL:GQ 0/1:-100,0,-100:99 0/1:-100,0,-100:99 chr1 2 . A G 99 PASS . GT:GL:GQ:PQ 1|1:-100,0,-100:99:60 0|1:-100,0,-100:99:50 chr1 3 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:60 0|0:-100,0,-100:99:60 chr1 4 . A G 99 FAIL . GT:GL:GQ 0/1:-100,0,-100:99 0/1:-100,0,-100:99 chr1 5 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:70 1|0:-100,0,-100:99:60 chr1 6 . A G 99 PASS . GT:GL:GQ:PQ 0/1:-100,0,-100:99 1|1:-100,0,-100:99:70 chr1 7 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:80 0|1:-100,0,-100:99:70 chr1 8 . A G 99 PASS . GT:GL:GQ:PQ 0|1:-100,0,-100:99:90 0|1:-100,0,-100:99:80
The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1:
And two haplotypes at positions 6-8:
And, SAMP2 has the two haplotypes at positions 1-8:
To call variants with the GATK using pedigree information, you should base your workflow on the Best Practices recommendations -- the principles detailed there all apply to pedigree analysis.
But there is one crucial addition: you should make sure to pass a pedigree file (PED file) to all GATK walkers that you use in your workflow. Some will deliver better results if they see the pedigree data.
At the moment there are two of the standard annotations affected by pedigree:
Note that you will need at least 10 founders to compute the inbreeding coefficient.
In the specific case of trios, an additional GATK walker, PhaseByTransmission, should be used to obtain trio-aware genotypes as well as phase by descent.
The annotations mentioned above have been adapted for PED files starting with GATK v.1.6. If you already have VCF files generated by an older version of the GATK or have not passed a PED file while running the UnifiedGenotyper or VariantAnnotator, you should do the following:
-G StandardAnnotationto VariantAnnotator. Make sure you pass your PED file to the VariantAnnotator as well!
The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.
For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.
Hello! I was wondering if the HaplotypeScore annotation was restored for HaplotypeCaller in GATK 2.6. Does it have to be called? (It's not included in my vcf file.) Moreover, all of the GT field designations have "/" instead of "|" which according to the following would mean that the results are still unphased:
"GT genotype, encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are: / : genotype unphased | : genotype phased" http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40
Also, is there a more detailed explanation than what's on the HaplotypeScore documentation page? How is the score determined in UnifiedGenotyper? Does the score have anything to do with phasing? Also, how is phasing achieved if only the 10bps surrounding the SNP are examined, regions which likely do not include other SNPs?
I used Beagle to phase my data but for some indels, I have some probleme :
Input vcf :
2 68599872 . ATG A 14.40 PASS AC=1;AC1=1;AF=0.028
Input for beagle created by ProduceBeagleInput:
2:68599872 TG - 1.0000 0.0000 0.0000 ......
Output vcf created by BeagleOutputToVCF:
2 68599872 . ATG . 14.40 BGL_RM_WAS_- AC1=1;AF1=0.02965.....
error message by CombineVariants:
MESSAGE: Badly formed variant context at location 68599872 in contig 2. Reference length must be at most one base shorter than location size
Can you help me?