All our variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants (which you can bootstrap if there is none available for your organism), then it adjusts the base quality scores in the data based on the model.
In addition, there is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process.
I'm trying to run AnalyzeCovariates to produce calibration plots, but not getting a PDF, so I decided to upgrade my R installation and all the packages required (gsalib, ggplot2, etc). Now I'm getting the following error:
ERROR MESSAGE: Bad input: The GATK report has an unknown/unsupported version in the header: %PDF-1.4
I'm using GATK version 2.8-1-g932cd3a.
Here's the command I'm running:
java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates \ -R /path/genome.fa \ -L /path/genome.interval_list \ -before recal1.table \ -after recal2.table \ -plots recal.pdf \ -csv recal.csv
I'm using the latest version of R and all the packages. Here's my R
> sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-unknown-linux-gnu (64-bit) locale:  C attached base packages:  grid tools stats graphics grDevices utils datasets  methods base other attached packages:  gsalib_2.0 reshape_0.8.4 plyr_1.8  gplots_2.12.1 ggplot2_0.9.3.1 BiocInstaller_1.12.0 loaded via a namespace (and not attached):  KernSmooth_2.23-10 MASS_7.3-29 RColorBrewer_1.0-5 bitops_1.0-6  caTools_1.16 colorspace_1.2-4 dichromat_2.0-0 digest_0.6.4  gdata_2.13.2 gtable_0.1.2 gtools_3.2.1 labeling_0.2  munsell_0.4.2 proto_0.3-10 reshape2_1.2.2 scales_0.2.3  stringr_0.6.2
I've seen in many other posts suggestions to manually run the BQSR.R script on the data, but I don't have a CSV file yet, and there were no instructions on how to manually run BQSR.R, i.e., what arguments to specify to the Rscript command, and in what order.
Any help solving this problem would be greatly appreciated.
I am running GATK in clusters via pbs scheduling, and found "AnalyzeCovariates" could not use customized Rscript path.
All nodes have CentOS installed, R is already installed and could be found under "/usr/bin/R" from "which R". Unfortunately, R version is not identical among nodes, i.e., some nodes have R 2.15, and some have R 3.0 installed.
I installed the latest R version under my home folder, and add following commands to .bash_profile and .bash_rc:
lsb_release -i|cut -c17-20 == 'Cent' ] ; then
If I login to the cluster via qsub -I, and type R in the console, customized R will be invoked, and this is also shown in "which R" :
alias R='/home/XXX/R-3.0.2/bin/R' ~/R-3.0.2/bin/R
All GATK required packages have been installed.
However, when I run AnalyzeCovariates, it reported that some packages are missing, and it turns out that AnalyzeCovariates is using the R under "/usr/bin/R". So how to make AnalyzeCovariates use the right R? Do I miss something in the bash configure files?
I am using GATK 2.7.2. I am working on the Best practices of GATK. I have followed all the steps as mentioned for Best practices. I want to Generate before/after plots. This is done by the following command
-T AnalyzeCovariates -R ReferenceFiles\sequence.fasta -l DEBUG -before ReferenceFiles\recal_data.table -after ReferenceFiles\post_recal_data.table -plots ReferenceFiles\recalibration_plots.pdf
On running this command I get the error. Please refer attachment for error : “GATK_AnalyzeCovariant_Error.txt”
After referring the forums on the http://www.broadinstitute.org : -I have already installed R script and set R_HOME in my environment variables and also in the path. -I have copied the BQSR.R in the GATK tools folder. -I have installed the gsalib package in R -I have installed the ggplot2 package in R -Since I thought It can be network proxy issue, I have also registered on http://www.broadinstitute.org forum and asked for the .key file which is used to disable "phone-home" feature that sends us information about each GATK run via the Broad file system (within the Broad) and Amazon's S3 cloud storage service (outside the Broad). It will be reviewed by them and then I can get my key.
Please help me to know what exactly can be the issue.
I preformed Phase 1 with GATK 2.5-2. Has 2.6-5 changed enough to warrant redoing with the GATK 2.6-5? In particular, I would like to use the new plotting features of AnalyzeCovariates. Do I need to redo this in order to use the latest?
If I can use GATK 2.5-2 for Phase 1, can I move on with GATK 2.6-5?
In GATK 2.6, there have been some changes to BaseRecalibrator. Based on the AnalyzeCovariates page, it must now be run twice. To generate the first pass recalibration table file, it's the same command as before. To generate the second pass recalibration table file, you need to add the -BQSR argument. However, on the BaseRecalibrator page, there is no -BQSR documentation.
in Step 3, the example of code still has the deprecated walker
which when used generates this,
"ERROR MESSAGE: Walker AnalyzeCovariates is no longer available in the GATK; it has been deprecated since version 2.0 (use BaseRecalibrator instead; see documentation for usage)"