A GATKReport is simply a text document that contains well-formatted, easy to read representation of some tabular data. Many GATK tools output their results as GATKReports, so it's important to understand how they are formatted and how you can use them in further analyses.
Here's a simple example:
#:GATKReport.v1.0:2 #:GATKTable:true:2:9:%.18E:%.15f:; #:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads cycle errorrate.61PA8.7 qualavg.61PA8.7 0 7.451835696110506E-3 25.474613284804366 1 2.362777171937477E-3 29.844949954504095 2 9.087604507451836E-4 32.875909752547310 3 5.452562704471102E-4 34.498999090081895 4 9.087604507451836E-4 35.148316651501370 5 5.452562704471102E-4 36.072234352256190 6 5.452562704471102E-4 36.121724890829700 7 5.452562704471102E-4 36.191048034934500 8 5.452562704471102E-4 36.003457059679770 #:GATKTable:false:2:3:%s:%c:; #:GATKTable:TableName:Description key column 1:1000 T 1:1001 A 1:1002 C
This report contains two individual GATK report tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data.
We provide an R library called
gsalib that allows you to load GATKReport files into R for further analysis. Here are four simple steps to getting
gsalib, installing it and loading a report.
$ R R version 2.11.0 (2010-04-22) Copyright (C) 2010 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.
gsaliblibrary from CRAN
gsalib library is available on the Comprehensive R Archive Network, so you can just do:
From within R (we use RStudio for convenience).
In some cases you need to explicitly tell R where to find the library; you can do this as follows:
$ cat .Rprofile .libPaths("/path/to/Sting/R/")
> d = gsa.read.gatkreport("/path/to/my.gatkreport") > summary(d) Length Class Mode CountVariants 27 data.frame list CompOverlap 13 data.frame list
In the output grp file,
#:GATKReport.v1.1:1 #:GATKTable:3:880:%s:%s:%s:; #:GATKTable:BaseCoverageDistribution:A simplified GATK table report Coverage Count Filtered 0 2859049 2932784 1 856997 837791 2 288587 276253 3 95618 91703
what's the meaning of the three columns?
I just finished running a fairly large number of WGS samples through HaplotypeCaller and I've been using VariantEval to look at some summary stats on these samples. I've noticed that under '#:GATKTable:VariantSummary:1000 Genomes Phase I summary of variants table' there's a section on structural variations and that apparently I'm getting about 3500 in one of my samples. Here's the actual section of the table in question:
#:GATKTable:20:3:%s:%s:%s:%s:%s:%d:%d:%d:%.2f:%s:%d:%.2f:%.1f:%d:%s:%d:%.1f:%d:%s:%d:; #:GATKTable:VariantSummary:1000 Genomes Phase I summary of variants table VariantSummary CompRod EvalRod JexlExpression Novelty nSamples nProcessedLoci nSNPs TiTvRatio SNPNoveltyRate nSNPsPerSample TiTvRatioPerSample SNPDPPerSample nIndels IndelNoveltyRate nIndelsPerSample IndelDPPerSample nSVs SVNoveltyRate nSVsPerSample VariantSummary dbsnp vcf1 none all 1 3095693981 3446166 2.08 1.34 3446166 2.08 0.0 962028 15.33 962028 0.0 3282 73.58 3282 VariantSummary dbsnp vcf1 none known 1 3095693981 3399907 2.08 0.00 3399907 2.08 0.0 814506 0.00 814506 0.0 867 0.00 867 VariantSummary dbsnp vcf1 none novel 1 3095693981 46259 1.71 100.00 46259 1.71 0.0 147522 100.00 147522 0.0 2415 100.00 2415
I didn't think that HaplotypeCaller even looked for structural variations, so I tried to find these structural variations in the VCF, hoping they were encoded as described here and I couldn't find anything. Could someone tell me why VariantEval is showing a number of structural variations but the actual VCF isn't finding any? Does VariantEval just interpret a sufficiently large indel as a SV? If so, I can understand why it may call some structural variations considering there are indels longer than 1k bp in the indels of the sample.