Tagged with #analyzecovariates
1 documentation article | 0 announcements | 12 forum discussions

Created 2014-06-12 22:52:09 | Updated 2015-04-26 00:23:41 | Tags: bqsr dependencies rscript analyzecovariates

Comments (19)

When you run AnalyzeCovariates to analyze your BQSR outputs, you may encounter an error starting with this line:

org.broadinstitute.sting.utils.R.RScriptExecutorException: RScript exited with 1. Run with -l DEBUG for more info.

The main reason why this error often occurs is simple, and so is the solution. The script depends on some external R libraries, so if you don't have them installed, the script fails. To find out what libraries are necessary and how to install them, you can refer to this tutorial.

One other common issue is that the version of ggplot2 you have installed is very recent and is not compatible with the BQSR script. If so, download this Rscript file and use it to generate the plots manually according to the instructions below.

If you have already checked that you have all the necessary libraries installed, you'll need to run the script manually in order to find out what is wrong. To new users, this can seem complicated, but it only takes these 3 simple steps to do it!

1. Re-run AnalyzeCovariates with these additional parameters:

  • -l DEBUG (that's a lowercase L, not an uppercase i, to be clear) and
  • -csv my-report.csv (where you can call the .csv file anything; this is so the intermediate csv file will be saved).

2. Identify the lines in the log output that says what parameters the RScript is given.

The snippet below shows you the components of the R script command line that AnalyzeCovariates uses.

INFO  18:04:55,355 AnalyzeCovariates - Generating plots file 'RTest.pdf' 
DEBUG 18:04:55,672 RecalUtils - R command line: Rscript (resource)org/broadinstitute/gatk/utils/recalibration/BQSR.R /Users/schandra/BQSR_Testing/RTest.csv /Users/schandra/BQSR_Testing/RTest.recal /Users/schandra/BQSR_Testing/RTest.pdf 
DEBUG 18:04:55,687 RScriptExecutor - Executing: 
DEBUG 18:04:55,688 RScriptExecutor -   Rscript 
DEBUG 18:04:55,688 RScriptExecutor -   -e 
DEBUG 18:04:55,688 RScriptExecutor -   tempLibDir = '/var/folders/j9/5qgr3mvj0590pd2yb9hwc15454pxz0/T/Rlib.2085451458391709180';source('/var/folders/j9/5qgr3mvj0590pd2yb9hwc15454pxz0/T/BQSR.761775214345441497.R'); 
DEBUG 18:04:55,689 RScriptExecutor -   /Users/schandra/BQSR_Testing/RTest.csv 
DEBUG 18:04:55,689 RScriptExecutor -   /Users/schandra/BQSR_Testing/RTest.recal 
DEBUG 18:04:55,689 RScriptExecutor -   /Users/schandra/BQSR_Testing/RTest.pdf 

So, your full command line will be:

RScript BQSR.R RTest.csv RTest.recal RTest.pdf

Please note:

3. Run the script manually with the above arguments.

For new users, the easiest way to do this is to do it from within an IDE program like RStudio. Or, you can start up R at the command line and run it that way, whatever you are comfortable with.

No articles to display.

Created 2016-04-07 15:05:36 | Updated | Tags: baserecalibrator analyzecovariates convergence base-recalibration

Comments (2)


I have just run the base recalibration following GATK best practice. As I'm working on a non-model organism, I had to run a first round of haplotype caller and use the resulting variants (after filtration) to do the base recalibration as recommended by GATK best practrices.

Everything seems ok, the pipeline could be executed on my data without errors. However, when I checked for convergence after the base recalibrations (I ran a second round of BaseRecalibrator and then generated plots using AnalyzeCovariates), the reported base quality after the recalibration became so low... I had most of my bases with quality score higher than 20 but after the recalibration most of them became so low under 10 ! You can see in the attached file the plots generated by AnalyzeCovariates. The reported Q score after recalibration for the substitution is so low....

How could this happen? Does it just mean that I haven't yet reached the convergence and just need to conducts other rounds of recalibration? Could this be due to the data? The used variants may not be filtered with enough stringency and this results in messing up?

I would like to have your regards on this issue.


Created 2016-03-01 13:02:06 | Updated 2016-03-01 13:08:47 | Tags: baserecalibrator commandlinegatk queue qscript analyzecovariates

Comments (8)

Dear GATK team,

I'd like to ask a question about the possibility of HaplotypeCaller and AnalyzeCovariates running in parallel: I developed a QScript that runs indel realignment, BQSR, variant calling (obtaining gVCF file as a result) and then -- BaseRecalibrator for the second time followed by AnalyzeCovariates. Looking in QScript jobreport PDF file I noticed that the second run of BaseRecalibrator was performed after HaplotypeCaller, though as I can understand, HaplotypeCaller and the second run of BaseRecalibrator are independent regarding data and potentially can run in paralle.

Can I run them in parallel somehow in order to save time?

Created 2016-02-23 18:01:29 | Updated | Tags: commandlinegatk queue qscript analyzecovariates

Comments (2)

Dear GATK team,

I use Queue to build a pipeline with GATK tools: RealignerTargetCreator, IndelRealigner, BaseRecalibrator, AnalyzeCovariates, and HaplotypeCaller. So, I developed the corresponding QScript. When I tested it the very first time, all functions finished with no errors. But then, when I invoked it with -startfromScratch it failed to execute AnalyzeCovariates saying:

ERROR 19:08:14,037 FunctionEdge - Error: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLi mit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv' ERROR 19:08:14,045 FunctionEdge - Contents of bqsr-report.pdf.out: [...]

In bqsr-report.out file I can see no errors:

INFO 17:25:12,764 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,767 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:40 INFO 17:25:12,767 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:25:12,767 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:25:12,771 HelpFormatter - Program Args: -T AnalyzeCovariates -L intervals_to_process.interval_list -R Homo_sapiens_assembly38.fasta -before recal-table1.txt -after recal-table2.txt -plots bqsr-report.pdf -csv bqsr-report.csv INFO 17:25:12,779 HelpFormatter - Executing as [...] INFO 17:25:12,779 HelpFormatter - Date/Time: 2016/02/23 17:25:12 INFO 17:25:12,780 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,780 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,838 GenomeAnalysisEngine - Strictness is SILENT INFO 17:25:13,079 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:25:13,196 IntervalUtils - Processing 3088286401 bp from intervals INFO 17:25:13,277 GenomeAnalysisEngine - Preparing for traversal INFO 17:25:13,287 GenomeAnalysisEngine - Done preparing for traversal INFO 17:25:13,287 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:25:13,288 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:25:13,289 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:25:13,764 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 17:25:13,921 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 17:25:13,931 AnalyzeCovariates - Generating csv file 'bqsr-report.csv' INFO 17:25:14,164 AnalyzeCovariates - Generating plots file 'bqsr-report.pdf' INFO 17:25:20,279 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.gatk.tools.walkers.bqsr.AnalyzeCovariates$None@3ce295f9 INFO 17:25:20,282 ProgressMeter - done 0.0 6.0 s 11.6 w 100.0% 6.0 s 0.0 s INFO 17:25:20,283 ProgressMeter - Total runtime 7.00 secs, 0.12 min, 0.00 hours INFO 17:25:21,513 GATKRunReport - Uploaded run statistics report to AWS S3

I can see that AnalyzeCovariates and HaplotypeCaller start nearly at the same time:

INFO 17:25:09,264 FunctionEdge - Starting: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv' INFO 17:25:09,264 FunctionEdge - Output written to bqsr-report.pdf.out INFO 17:25:21,548 FunctionEdge - Starting: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'HaplotypeCaller' '-I' 'recaled.bam' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-variant_index_type' 'LINEAR' '-variant_index_parameter' '128000' '-o' 'sample.gvcf' '-D' 'dbsnp_144.hg38.vcf' '-ERC' 'GVCF' '-pcrModel' 'CONSERVATIVE' INFO 17:25:21,548 FunctionEdge - Output written to sample.gvcf.out

and after HaplotypeCaller finishes successfully (I checked that it produced a gVCF file), this message about AnalyzeCovariates error is printed:

INFO 19:08:14,029 QGraph - 0 Pend, 2 Run, 0 Fail, 10 Done ERROR 19:08:14,037 FunctionEdge - Error: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLi mit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv'

Of course, AnalyzeCovariates didn't produce CSV and PDF reports this way.

When I invoked the QScript again (this time not from scratch) to reproduce the situation, it executed with no errors, and AnalyzeCovariates generated both CSV and PDF reports.

After that I executed AnalyzeCovariates manually with -l DEBUG and it produced only CSV report, no PDF. Here is the central part of the debug output of AnalyzeCovariates:

`DEBUG 20:38:28,593 RecalUtils - R command line: Rscript (resource)org/broadinstitute/gatk/engine/recalibration/BQSR.R bqsr-report.csv recal-table1.txt bqsr-report1.pdf DEBUG 20:38:28,607 RScriptExecutor - Executing: DEBUG 20:38:28,607 RScriptExecutor - Rscript DEBUG 20:38:28,607 RScriptExecutor - -e DEBUG 20:38:28,608 RScriptExecutor - tempLibDir = '/tmp/Rlib.4876869209103817519';source('/tmp/BQSR.3779158431229884689.R'); DEBUG 20:38:28,608 RScriptExecutor - bqsr-report.csv DEBUG 20:38:28,608 RScriptExecutor - recal-table1.txt DEBUG 20:38:28,608 RScriptExecutor - bqsr-report1.pdf

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:


Warning messages: 1: NAs introduced by coercion 2: NAs introduced by coercion DEBUG 20:38:34,790 RScriptExecutor - Result: 0 INFO 20:38:34,792 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.gatk.tools.walkers.bqsr.AnalyzeCovariates$Non e@44723d95 INFO 20:38:34,795 ProgressMeter - done 0.0 7.0 s 11.6 w 100.0% 7.0 s 0.0 s INFO 20:38:34,795 ProgressMeter - Total runtime 7.03 secs, 0.12 min, 0.00 hours `

So, every time, when I start my QScript from scratch AnalyzeCovariates fails, but it finished successfully when I invoke my script the next time without -startFromScratch. When I execute AnalyzeCovariates manually with -l DEBUG, it produces CSV report only.

Should I use BQSR.R script directly?

I will be very grateful for any tips and help.

Created 2015-08-27 08:11:30 | Updated | Tags: analyzecovariates r warnings

Comments (10)


When running the BQSR script, I get the following warnings

Warning messages: 1: NAs introduced by coercion 2: NAs introduced by coercion

I have managed to track down exactly where they come from:

for(cov in levels(data$CovariateName)) {
  d = data[data$CovariateName==cov,]
  if( cov == "Context" ) {
    d$CovariateValue = as.character(d$CovariateValue)
    d$CovariateValue = substring(d$CovariateValue,nchar(d$CovariateValue)-2,nchar(d$CovariateValue))
  } else {
    d$CovariateValue = as.numeric(levels(d$CovariateValue))[as.integer(d$CovariateValue)]

Here the problem is that levels(d$CovariateValue) contains both integers and strings (short DNA sequences), and the latter causes as.numeric to introduce NAs.

Is this something to be worried about? I am using GATK 3.4-46, but the error also occurs in 3.3-0.

Thanks, Michael Knudsen

Created 2015-08-06 09:28:46 | Updated | Tags: analyzecovariates

Comments (6)


I have used AnalyzeCovariates to plot the before and after recalibration results. I noticed that the quality scores in insertion and deletion panel are quite high, is this normal? For covariates plot ( cycle and context), why are there positive and negative values for quality score accuracy (y-axis)? Is each point represent a base? Any documentation with detail explanations on how to interpret these plots?


Created 2014-05-17 00:05:22 | Updated 2014-05-17 00:12:43 | Tags: analyzecovariates

Comments (4)


I am trying to generate a base recalibration plots using AnalyzeCovariate

My command is such

java -jar GenomeAnalysisTK.jar \
-T AnalyzeCovariates -R GRCh37-lite.fa \
-before test_data/realigned/SA495-Tumor.sorted.realigned.grp \
-after test_data/realigned/SA495-Tumor.sorted.post_recal.grp2 \
-plots recal_plots.pdf

and this gives me an error

INFO  17:01:06,050 HelpFormatter - Date/Time: 2014/05/16 17:01:06
INFO  17:01:06,050 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:01:06,050 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:01:06,962 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:01:07,193 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  17:01:07,317 GenomeAnalysisEngine - Preparing for traversal
INFO  17:01:07,339 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:01:07,340 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  17:01:08,293 ContextCovariate -       Context sizes: base substitution model 2, indel substitution model 3
INFO  17:01:08,537 ContextCovariate -       Context sizes: base substitution model 2, indel substitution model 3
INFO  17:01:08,592 AnalyzeCovariates - Generating csv file '/tmp/AnalyzeCovariates3565832248324656361.csv'
INFO  17:01:09,077 AnalyzeCovariates - Generating plots file 'recal_plots.pdf'
INFO  17:01:18,598 GATKRunReport - Uploaded run statistics report to AWS S3
 ERROR ------------------------------------------------------------------------------------------
 ERROR stack trace
org.broadinstitute.sting.utils.R.RScriptExecutorException: RScript exited with 1. Run with -l DEBUG for more info.
    at org.broadinstitute.sting.utils.R.RScriptExecutor.exec(RScriptExecutor.java:174)
    at org.broadinstitute.sting.utils.recalibration.RecalUtils.generatePlots(RecalUtils.java:548)
    at org.broadinstitute.sting.gatk.walkers.bqsr.AnalyzeCovariates.generatePlots(AnalyzeCovariates.java:380)
    at org.broadinstitute.sting.gatk.walkers.bqsr.AnalyzeCovariates.initialize(AnalyzeCovariates.java:394)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)
 ERROR ------------------------------------------------------------------------------------------
 ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
 ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
 ERROR If not, please post the error message, with stack trace, to the GATK forum.
 ERROR Visit our website and forum for extensive documentation and answers to
 ERROR commonly asked questions http://www.broadinstitute.org/gatk
 ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info.
 ERROR ------------------------------------------------------------------------------------------

Ideas ? Thanks

Created 2014-02-06 16:17:55 | Updated | Tags: pdf analyzecovariates

Comments (11)

I'm trying to run AnalyzeCovariates to produce calibration plots, but not getting a PDF, so I decided to upgrade my R installation and all the packages required (gsalib, ggplot2, etc). Now I'm getting the following error:

ERROR MESSAGE: Bad input: The GATK report has an unknown/unsupported version in the header: %PDF-1.4

I'm using GATK version 2.8-1-g932cd3a.

Here's the command I'm running:

java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates \
    -R /path/genome.fa \
    -L /path/genome.interval_list \
    -before recal1.table \
    -after recal2.table \
    -plots recal.pdf \
    -csv recal.csv

I'm using the latest version of R and all the packages. Here's my R sessionInfo():

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

[1] C

attached base packages:
[1] grid      tools     stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] gsalib_2.0           reshape_0.8.4        plyr_1.8            
[4] gplots_2.12.1        ggplot2_0.9.3.1      BiocInstaller_1.12.0

loaded via a namespace (and not attached):
 [1] KernSmooth_2.23-10 MASS_7.3-29        RColorBrewer_1.0-5 bitops_1.0-6      
 [5] caTools_1.16       colorspace_1.2-4   dichromat_2.0-0    digest_0.6.4      
 [9] gdata_2.13.2       gtable_0.1.2       gtools_3.2.1       labeling_0.2      
[13] munsell_0.4.2      proto_0.3-10       reshape2_1.2.2     scales_0.2.3      
[17] stringr_0.6.2     

I've seen in many other posts suggestions to manually run the BQSR.R script on the data, but I don't have a CSV file yet, and there were no instructions on how to manually run BQSR.R, i.e., what arguments to specify to the Rscript command, and in what order.

Any help solving this problem would be greatly appreciated.

Created 2014-01-05 12:06:13 | Updated 2014-01-05 12:07:05 | Tags: analyzecovariates r

Comments (2)

I am running GATK in clusters via pbs scheduling, and found "AnalyzeCovariates" could not use customized Rscript path.

More info:

All nodes have CentOS installed, R is already installed and could be found under "/usr/bin/R" from "which R". Unfortunately, R version is not identical among nodes, i.e., some nodes have R 2.15, and some have R 3.0 installed.

I installed the latest R version under my home folder, and add following commands to .bash_profile and .bash_rc:

if [ lsb_release -i|cut -c17-20 == 'Cent' ] ; then alias R='/home/XXX/R-3.0.2/bin/R' alias Rscript='/home/XXX/R-3.0.2/bin/Rscript' fi

If I login to the cluster via qsub -I, and type R in the console, customized R will be invoked, and this is also shown in "which R" :

alias R='/home/XXX/R-3.0.2/bin/R' ~/R-3.0.2/bin/R

All GATK required packages have been installed.

However, when I run AnalyzeCovariates, it reported that some packages are missing, and it turns out that AnalyzeCovariates is using the R under "/usr/bin/R". So how to make AnalyzeCovariates use the right R? Do I miss something in the bash configure files?


Created 2013-12-03 07:06:47 | Updated | Tags: analyzecovariates

Comments (1)

I am using GATK 2.7.2. I am working on the Best practices of GATK. I have followed all the steps as mentioned for Best practices. I want to Generate before/after plots. This is done by the following command

-T AnalyzeCovariates -R ReferenceFiles\sequence.fasta -l DEBUG -before ReferenceFiles\recal_data.table -after ReferenceFiles\post_recal_data.table -plots ReferenceFiles\recalibration_plots.pdf

On running this command I get the error. Please refer attachment for error : “GATK_AnalyzeCovariant_Error.txt”

After referring the forums on the http://www.broadinstitute.org : -I have already installed R script and set R_HOME in my environment variables and also in the path. -I have copied the BQSR.R in the GATK tools folder. -I have installed the gsalib package in R -I have installed the ggplot2 package in R -Since I thought It can be network proxy issue, I have also registered on http://www.broadinstitute.org forum and asked for the .key file which is used to disable "phone-home" feature that sends us information about each GATK run via the Broad file system (within the Broad) and Amazon's S3 cloud storage service (outside the Broad). It will be reviewed by them and then I can get my key.

Please help me to know what exactly can be the issue.


Created 2013-08-07 14:16:09 | Updated | Tags: baserecalibrator analyzecovariates

Comments (1)

I preformed Phase 1 with GATK 2.5-2. Has 2.6-5 changed enough to warrant redoing with the GATK 2.6-5? In particular, I would like to use the new plotting features of AnalyzeCovariates. Do I need to redo this in order to use the latest?

If I can use GATK 2.5-2 for Phase 1, can I move on with GATK 2.6-5?

Thank you.

Created 2013-07-11 23:08:11 | Updated 2013-07-11 23:08:34 | Tags: baserecalibrator documentation analyzecovariates

Comments (1)

In GATK 2.6, there have been some changes to BaseRecalibrator. Based on the AnalyzeCovariates page, it must now be run twice. To generate the first pass recalibration table file, it's the same command as before. To generate the second pass recalibration table file, you need to add the -BQSR argument. However, on the BaseRecalibrator page, there is no -BQSR documentation.

Created 2013-07-05 16:57:08 | Updated | Tags: tutorials baserecalibrator analyzecovariates

Comments (1)

in Step 3, the example of code still has the deprecated walker
-T AnalyzeCovariants
which when used generates this,
"ERROR MESSAGE: Walker AnalyzeCovariates is no longer available in the GATK; it has been deprecated since version 2.0 (use BaseRecalibrator instead; see documentation for usage)"