Tagged with #genotype
1 documentation article | 0 announcements | 13 forum discussions


Comments (0)

Introduction

Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource.

The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.

Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.

Command-line arguments

Usage of GenotypeAndValidate and its command line arguments are described here.

The VCF Annotations

The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F.

The Outputs

GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:

ALT REF Predictive Value
called alt True Positive (TP) False Positive (FP) Positive PV
called ref False Negative (FN) True Negative (TN) Negative PV

The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnose.

The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.

The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).

Additional Details

  • You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode)

  • The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).

Examples

Genotypes BAM file from new technology using the VCF as a truth dataset:

java \
    -jar /GenomeAnalysisTK.jar \
    -T  GenotypeAndValidate \
    -R human_g1k_v37.fasta \
    -I myNewTechReads.bam \
    -alleles handAnnotatedVCF.vcf \
    -BTI alleles \
    -o gav.vcf

An annotated VCF example (info field clipped for clarity)

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
1   20568807    .   C   T   0    HapMapHet        AC=1;AF=0.50;AN=2;DP=0;GV=T  GT  0/1
1   22359922    .   T   C   282  WG-CG-HiSeq      AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ  1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99    ./.
13  102391461   .   G   A   341  Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ  ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99   ./.
1   175516757   .   C   G   655  SnpCluster,WG    AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ  ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99   ./.

Using a BAM file as the truth dataset:

java \
    -jar /GenomeAnalysisTK.jar \
    -T  GenotypeAndValidate \
    -R human_g1k_v37.fasta \
    -I myTruthDataset.bam \
    -alleles callsToValidate.vcf \
    -BTI alleles \
    -bt \
    -o gav.vcf

Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker:

PacBio PbGenotypeAndValidate results

No posts found with the requested search criteria.
Comments (10)

generated with gatk 2.8-1-g932cd3a

Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):

CHROM   POS ID  REF ALT FILTER  QUAL    ABHet   ABHom   AC  AF  AN  BaseCounts  BaseQRankSum    DP  Dels    FS  GC  HRun    HaplotypeScore  LowMQ   MLEAC   MLEAF   MQ  MQ0 MQRankSum   MeanDP  MinDP   OND PercentNBaseSolid   QD  ReadPosRankSum  Samples Somatic VariantType cosmic.ID   1.AB    1.AD    1.DP    1.F 1.GQ    1.GT    1.MQ0   1.PL    1.Z 2.AB    2.AD    2.DP    2.F 2.GQ    2.GT    2.MQ0   2.PL    2.Z 3.AB    3.AD    3.DP    3.F 3.GQ    3.GT    3.MQ0   3.PL    3.Z 4.AB    4.AD    4.DP    4.F 4.GQ    4.GT    4.MQ0   4.PL    4.Z 5.AB    5.AD    5.DP    5.F 5.GQ    5.GT    5.MQ0   5.PL    5.Z
11  92616485    0   A   C   PASS    63.71   0.333   0.698   1   0.1 10  89,54,0,0   -5.631  143 0   49.552  71.29   2   4.4154  0.0000,0.0000,143   1   0.1 50.27   0   -1.645  28.6    16  0.242   0   2.36    2.125   R5_A3_1 NA  SNP COSM467570  NA  24,9    33  0.2727272727    54  A/A 0   0,54,537    -1.3055824197   0.33    9,18    27  0.6666666667    96  A/C 0   96,0,178    0.8660254038    NA  21,11   32  0.34375 21  A/A 0   0,21,466    -0.8838834765   NA  12,4    16  0.25    27  A/A 0   0,27,272    -1  NA  23,12   35  0.3428571429    42  A/A 0   0,42,537    -0.9296696802

This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....

Comments (4)

Dear Team, I was looking at a VCF file produced with UnifiedGenotyper (2.4.9). It is a multisample call and, for a limited number of calls, I have genotypes that are telling the exact opposite of AD field, as in this case

GT:AD:DP:GQ:PL  1/1:10,1:11:3:24,3,0

or

GT:AD:DP:GQ:PL  1/1:18,1:19:3:22,3,0

I have ten reads supporting the reference allele, 1 read supporting the alternate and the genotype is 1/1. This is happening in ~200 sites per sample in my dataset. I've checked the other way around and I found <100 sites in which the genotype is called 0/0 and the AD suggests 1/1 or (more frequently) 0/1. This seems to happen in sites in which the number of variant samples is low (no more than 3 samples in a set of ~50 samples) and it is puzzling me a lot. Can you give me a comment on why this is happening? Thanks

d

Comments (7)

I'm analyzing seven trio exomes right now with the latest GATK (version 2.7-4-g6f46d11), and was surprised to find a large number of mendelian violations reported by PhaseByTransmission, even after eliminating low/no coverage events. Tracking down the problem, it seems that CombineVariants occasionally propagates the PL field to the new vcf file incorrectly, sometimes in a way which causes GT not to correspond to the lowest PL.

Here's an example, showing just the GT, AD, and PL columns for a few positions in one trio. For each position, the first line contains the genotypes from the original vcf file, and the second shows the genotypes from the merged file.

#CHROM  POS       ID           REF  ALT  100403001-1           100403001-1A           100403001-1B        
                                                                                                          
1       5933530   rs905469     A    G    0/0:37,0:0,99,1192    0/0:35,0:0,90,1101     0/0:44,0:0,117,1412 
1       5933530   rs905469     A    G    0/0:37,0:189,15,1192  0/0:35,0:0,90,1101     0/0:44,0:0,117,1412 
                                                                                                          
1       10412636  rs4846215    A    T    0/0:119,0:0,358,4297  0/0:113,0:0,337,4060   0/0:102,0:0,304,3622
1       10412636  rs4846215    A    T    0/0:119,0:110,9,0     0/0:113,0:0,337,4060   0/0:102,0:0,304,3622
                                                                                                          
1       11729035  rs79974326   G    C    0/0:50,0:0,141,1709   0/0:53,0:0,150,1788    0/0:71,0:0,187,2246 
1       11729035  rs79974326   G    C    0/0:50,0:1930,0,3851  0/0:53,0:0,150,1788    0/0:71,0:0,187,2246 
                                                                                                          
1       16735764  rs182873855  G    A    0/0:54,0:0,138,1691   0/0:57,0:0,153,1841    0/0:47,0:0,120,1441 
1       16735764  rs182873855  G    A    0/0:54,0:174,0,1691   0/0:57,0:0,153,1841    0/0:47,0:0,120,1441 
                                                                                                          
1       17316577  rs77880760   G    T    0/0:42,0:0,123,1470   0/0:38,0:0,111,1317    0/0:53,0:0,153,1817 
1       17316577  rs77880760   G    T    0/0:42,0:233,17,1470  0/0:38,0:0,111,1317    0/0:225,25:0,153,181
                                                                                                          
1       28116000  rs2294229    A    G    0/0:37,0:0,105,1291   0/0:37,0:0,111,1379    0/0:30,0:0,87,1066  
1       28116000  rs2294229    A    G    0/0:37,0:0,105,1291   0/0:37,0:0,111,1379    0/0:30,0:1844,159,0 
                                                                                                          
1       31740706  rs3753373    A    G    0/0:123,0:0,349,4173  0/0:110,0:0,319,3793   0/0:111,0:0,328,3885
1       31740706  rs3753373    A    G    0/0:123,0:117,6,0     0/0:110,0:0,319,3793   0/0:111,0:0,328,3885

Most genotypes are propagated correctly, and in fact, which a propagated incorrectly changes from run to run.

In my case, I'm merging files from disjoint regions, so I can work around the problem, but it would be nice if this were fixed.

Thanks, Kevin

Comments (9)

Hi,

Could you tell me how to encourage GATK to annotate my genotype columns (i.e. add annotations to the FORMAT and PANC_R columns in the following file):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PANC_R 
chrX 259221 . GA G 136.74 . AC=2;AF=1.00;AN=2;DP=15;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=8.82;MQ0=1;QD=3.04 GT:AD:GQ:PL 1/1:0,2:6:164,6,0

The file was generated with HaplotypeCaller. I used a command line similar to this one to no effect:

java -jar $GATKROOT/GenomeAnalysisTK.jarT VariantAnnotator -R hg19_random.fa -I chr7_recalibrated.bam -V chr7.vcf --dbsnpdbSNP135_chr.vcf -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -o chr7_annotated-again.vcf

Does anyone have any suggestions? Thanks in advance!

Comments (4)

I am doing a WGS project on a family with seven siblings. We have data on the mother but the father passed many years ago. I tried splitting variant recalibrated vcf file and ped file into "trios" with just the mother and a sibling (seven times) then running PhaseByTransmission on the combined vcf. The job was successfully completed but nothing appears phased (all "/," and no "|") in the output vcf. I also tried the variant recalibrated vcf file separately with ReadBackedPhasing. The job was successfully completed as well but again nothing appears phased (all "/" and no "|" or assigned "PQ" scores). The ProduceBeagleInput walker (to use Beagle for genotype refinement) appears to only support unrelated individuals and my set involves related individuals. Do you have any other suggestions for phasing incomplete "trios?" Thanks in advance!

Comments (1)

Hi. I have converted the tsv file from Complete genomics(CGA tool) to vcf format. but, when I run programs, it says some error in my vcf format. I carefully examined the vcf file and found, in some SNVs the GT is 1|.(dot). Is it a valid vcf file or is there any problem in vcf file? The program showed the error in those lines in which GT is 1|.(dot) i.e half genotype information. The line looks like this:

1   55164   .   C   A   .   .   NS=1;AN=1;AC=1;CGA_XR=dbsnp.103|rs3091274;CGA_RPT=L2|L2|49.7;CGA_SDO=2  GT:PS:FT:HQ:EHQ:CGA_CEHQ:GL:CGA_CEGL:DP:AD:CGA_RDP  1/.:.:VQLOW:30,.:30,.:8,.:-30,0,0:-8,0,0:18:18,.:0
Comments (1)

While running VariantEval, I'm trying to stratify by a JexlExpression by setting using

-ST Sample -ST JexlExpression -select "GQ>20"

This fails with a "variable does not exist" error despite the GQ existing in all genotypes in the vcf. Looking at the code it seems that the pathway that loads the JexlExpression in the VariantEval class specifically does not provide the genotype as context (only the VariantContext) and thus, the context for the Jexl does not include GT and the error is produced.

My question is: Is this a feature or a bug? It seems possible to add the genotype (when the VC only has one, or loop over the genotypes and either OR or AND the results (perhaps another input similar to -isr?), but perhaps I'm missing something subtle?

Would you like this behavior or are you happy with the current operation of jexlExpression?

Cheers!

Comments (2)

Hello,

I'm sorry if I'm being dense (I'm new to all this and it is making me feel very dense indeed!), but having read the section on 'Selecting an appropriate quality score threshold' on the 'Best Practice Variant Detection' page, I am still unclear as to whether you mean I should be looking for a QUAL score of at least 30 in a deep coverage data set and should filter out any suggested SNPs that don't meet this, or a GQ score of 30 in each individual sample genotyped at the SNP in question and I only need to filter out individual samples that don't meet this threshold.

Please can you clarify?

I have pasted the bit of text I read below, just to make it clear to which bit I am referring.

Many thanks!

A common question is the confidence score threshold to use for variant detection. We recommend:

Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.

Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.

Comments (11)

Hello the team,

For some genotypes, it seems are wrong, I know it's model based, and base q, map q, etc are considered in the model. I also read this link: http://gatkforums.broadinstitute.org/discussion/1235/why-didnt-the-unified-genotyper-call-my-snp-i-can-see-it-right-there-in-igv#latest But my case are special, the format is (ref allele count)/(alternative allele count) genotype call: 22/24 0/0 109/125 0/0 85/109 0/0 26/32 0/0 40/161 0/0 195/6 1/1 239/5 1/1 83/6 1/1 46/28 1/1

In one case, the two variants are adjacent to each other. In some case, they are one base indels.

Thanks,

Jim

Comments (2)

I've tried to get genotype for all sites provided in interval file using haplotypeCaller. If using unifiedGenotyper, I can get the result by setting "output_mode EMIT_ALL_SITES". But haplotypeCaller doesn't report as expected by "output_mode EMIT_ALL_SITES". Even though I set "genotypeFullActiveRegion" or "fullHaplotype", haplotypeCaller doesn't seem to emit genotype at all sites. How to get desirable result using haplotypeCaller? Thanks!

Comments (1)

I have an inbred mouse strain that I am sequencing and there should be little to NO heterozygosity. Yet with the default settings of UGT -heterozygosity (which is 0.001) many homs are being called as hets. When 230/250 reads are alternate and 20/250 are reference, it calls a het, even though it should be homozygous alternate.

What do you recommendations for this setting for inbred animals?

thanks, GATK is great!

Vivek

Comments (8)

I was running UnifiedGenotyper on a set of 26 bam files. There was one particular position where I was comparing calls to the actual pileup and I noticed a major discrepancy. There was a no-call ("./.") for that position for one of the bam files while most other samples had calls. That non-called sample, though, had a very convincing variant in the pileup, with lots of high quality coverage at that position.

I then tried running just that bam file alone through UnifiedGenotyper, or that bam file along with two others. In both cases, the 1/1 variant is called properly with the following genotype field:

GT:AD:DP:GQ:MQ0:PL 1/1:0,66:66:99:0:2337,187,0

This seems to me to be a serious bug. Is this anything that's been noted before?

I am running GATKLite version 2.1-3-ge1dbcc8

Gene

Comments (6)

Hi,

I am observing the following scenario at one particular SNP (C/G) using two different enrichment technologies: (I am using IGV syntax: ALLELE|number of reads w/ allele|%of total reads|+strand reads|- strand reads) - technology1:
C: 15 47% 15+,0- G: 17 53% 17+,0- - technology2:
C: 17 37% 13+,4- G: 29 63% 26+,3- As you can see both technologies have good coverage of the SNP and also good representation of each allele. SNP(C/G) does not get called in technology1.

My questions are: 1- Does the GATK algorithm have some sort of constraint on the proportion of reads coming from only one strand (as with technology1) in order to try to predict or discard duplicates? 2- I know that the base call of a particular base is bounded by the mapping quality of its read. If my --stand_call_conf is 30 and one of the bases at this SNP position has MQ<30 does this avoid this position getting called? Or is it more like the avg(MQ) has to be >30 (meaning more than one read at this position is taken into account)?

Thanks for any clarification, Gene