Tagged with #variant-recalibration
0 documentation articles | 0 announcements | 8 forum discussions

No articles to display.

No articles to display.

Created 2016-01-25 00:47:12 | Updated | Tags: vqsr resource variant-recalibration

Comments (1)

Hi, I am following through the best practice pipeline and I am confused of which files I should use as the resource to run VQSR.

I have downloaded 4 files and I just want to make sure which file is which database.

1000G_omni2.5.b37.vcf.gz = omni,vcf 1000G_phase1.snps.high_confidence.b37.vcf.gz = 1000G.vcf dbsnp_138.b37.vcf.gz = snp.vcf hapmap_3.3.b37.vcf.gz = hapmap.vcf

I am planning to unzipped them and run VQSR. Did I download the correct files? If not, could you tell me which files I need to use? Thank you.


Created 2015-02-25 02:12:39 | Updated | Tags: vqsr baserecalibrator haplotypecaller knownsites resources variant-recalibration

Comments (2)

Hi, I have a general question about the importance of known VCFs (for BQSR and HC) and resources file (for VQSR). I am working on rice for which the only known sites are the dbSNP VCF files which are built on a genomic version older than the reference genomic fasta file which I am using as basis. How does it affect the quality/accuracy of variants? How important is to have the exact same build of the genome as the one on which the known VCF is based? Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species? In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file? If the answers to the above questions are too detailed, can you please point me to any document, if available, which might address this issue?

Thanks, NB

Created 2015-02-18 05:48:33 | Updated 2015-02-18 05:49:23 | Tags: variantrecalibrator annotations variant-recalibration

Comments (8)

Hi, I am working on variant calling for a genotype of rice and I just generated a GVCF file. Here is a snippet of the GVCF file.

Chr1 173923 . C T, 40.74 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,3,0:3:9:73,9,0,73,9,73:0,0,2,1 Chr1 247702 . G A, 16.63 . DP=2;MLEAC=2,0;MLEAF=1.00,0.00;MQ=29.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:48,6,0,48,6,48:0,0,0,0 Chr1 247703 . G . . END=247714 GT:DP:GQ:MIN_DP:PL 0/0:2:5:2:0,6,43 ..... Chr1 248043 . A C, 0 . DP=2;MLEAC=0,0;MLEAF=0.00,0.00;MQ=27.07;MQ0=0 GT:AD:DP:GQ:PL:SB 0/0:1,0,0:1:3:0,3,45,3,45,45:0,0,0,0

For the VQSR step, I only have the dbSNP file for rice (there are no hapmap or 1000G files) which has following annotation: CHROM POS ID REF ALT QUAL FILTER INFO Chr1 84 rs349504122 T C . . RSPOS=84;dbSNPBuildID=138;SAO=0;VC=snp;VP=050000000000000000020100 Chr1 2223 rs349642167 C T . . RSPOS=2223;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2228 rs350936653 A C . . RSPOS=2228;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2417 rs347865191 G A . . RSPOS=2417;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2546 rs349956414 A C . . RSPOS=2546;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2588 rs352913853 A C . . RSPOS=2588;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100 Chr1 2634 rs348515762 C A . . RSPOS=2634;GENEINFO=4326813:Os01g0100100;dbSNPBuildID=138;SAO=0;VC=snp;VP= 050000000000000000020100

Now, if I use -an QD for VQSR, it wont work because QD is not present in the input vcf file. So, given the two files above, what annotation can I use for VQSR step? Can I use the VQSR step at all?

And this is the setting in which I am using the dbSNP: -resource:dbSNP,known=true,training=true,truth=true,prior=6.0 Are the settings of known/training/truth correct if dbSNP is the only resource?

Thanks in advance, you have been very helpful NB

Created 2015-02-17 03:22:15 | Updated | Tags: variantrecalibrator variant-recalibration

Comments (6)

Hi, I get the following error with VariantRecalibrator when I run it:

java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R Os_nipponbare_v7.0_genome.fasta -input ESSR.vcf -mode SNP -recalFile ERS468243.raw.snps.recal -tranchesFile ERS468243.raw.snps.tranches -rscriptFile ERS468243.recal.plots.R -resource:dbSNP,known=true,training=true,truth=true,prior=6.0 all_sorted.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ

ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.

What am I doing wrong? Am I using the wrong annotation? Thanks in advance for any help. Best, NB

Created 2014-10-08 15:30:10 | Updated | Tags: variantrecalibrator dbsnp variant-recalibration rice

Comments (7)

Hi, I am running GATK on rice libraries and I get the following error with VariantRecalibrator. The command I use is:

java -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T VariantRecalibrator -R genome.fasta -input chr5.vcf -mode SNP -recalFile chr5.raw.snps.recal -tranchesFile chr5.raw.snps.tranches -rscriptFile chr5.recal.plots.R -resource:dbSNP,known=true,training=true,truth=true,prior=6.0 all_chromosomes.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ

This is how the program terminates:

INFO 21:42:35,099 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 21:42:35,102 TrainingSet - Found dbSNP track: Known = true Training = true Truth = true Prior = Q6.0

INFO 21:43:05,104 ProgressMeter - Chr11:11654383 9.83e+06 30.0 s 3.0 s 87.7% 34.0 s 4.0 s INFO 21:43:07,413 VariantDataManager - QD: mean = 27.02 standard deviation = 7.70 INFO 21:43:07,440 VariantDataManager - MQRankSum: mean = -0.62 standard deviation = 3.35 INFO 21:43:07,479 VariantDataManager - ReadPosRankSum: mean = 0.25 standard deviation = 1.63 INFO 21:43:07,510 VariantDataManager - FS: mean = 2.11 standard deviation = 9.67 INFO 21:43:07,525 VariantDataManager - MQ: mean = 47.09 standard deviation = 11.34 INFO 21:43:07,660 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, MQRankSum, ReadPosRankSum] INFO 21:43:07,680 VariantDataManager - Training with 7845 variants after standard deviation thresholding.

INFO 21:43:09,833 VariantRecalibratorEngine - Convergence after 93 iterations! INFO 21:43:09,872 VariantRecalibratorEngine - Evaluating full set of 272372 variants... INFO 21:43:09,884 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 21:43:10,854 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:359) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:139) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

I have read similar problems on GATK forums and based on those, it seems to me that the training set of VCFs is too small for my data. Is that so? If so, can you please tell me how can I fix it? This is for rice and I only have 1 set of known VCFs from dbSNP.

Can you also please confirm if my settings for 'known', 'training' and 'truth' are correct.

Thanks a lot in advance.

Created 2014-09-09 15:44:02 | Updated | Tags: vcf variant-recalibration

Comments (3)

Hello, What annotations are recommended to be used for the variant recalibrator for an exome sequencing project? I am using the Ion proton system so my VCF files are automatically generated after the sequencing run. Also, would I still have to mark duplicates and realign indels in my BAM files before viewing them on a genome viewer like IGV? Since I already have my VCF files, I don't know if this will be necessary. Thank you, these forums are very helpful! Ricky

Created 2014-06-20 21:57:33 | Updated | Tags: variant-recalibration

Comments (5)

Hi there,

I'm trying to write a simple scala script to run Variant Recalibrator. However, I'm having trouble setting the Variant Recalibrator "mode" properly in scala. The relevant portion of my script is:

def script() { 
    val VR = new VariantRecalibrator with CommonArguments
            VR.resource = qscript.resourceSNPFile
    VR.input :+= qscript.vcfFile
    VR.use_annotation :+= "DP"
    VR.use_annotation :+= "QD"
    VR.use_annotation :+= "FS"
    VR.use_annotation :+= "MQRankSum"
    VR.use_annotation :+= "ReadPosRankSum"
    VR.mode = "SNP"
    VR.nt = 4
    VR.TStranche :+= "100.0"
    VR.TStranche :+= "99.9"
    VR.TStranche :+= "99.0"
    VR.TStranche :+= "90.0"
    VR.recalFile = qscript.snpRecal
    VR.tranchesFile = qscript.snpTranches
    VR.rscriptFile = qscript.snpPlots

The problem is with the VR.mode line. If I say VR.mode = "SNP" I get a "type mismatch" error; if I say VR.mode = SNP (i.e. without the quotes) I get an error that says "not found: value SNP". How do I code this correctly so that scala can interpret that I'm running VariantRecalibrator in SNP mode. I've also tried VR.mode :+= "SNP" and VR.mode :+= SNP, but get error messages with this as well.

Thank you for any help!

Created 2014-03-24 23:49:59 | Updated 2014-03-24 23:52:11 | Tags: variantrecalibrator applyrecalibration tranches variant-recalibration

Comments (20)

Hi, I am planning to filter with ApplyRecal very stringently (I don't mind losing a lot of SNPs to reduce false positives), but I am having trouble viewing the tranches plot produced by VariantRecal to choose my cutoff level. When I run the following command I get the default visualisation of 90, 99, 99.9 & 100 with cumulative TPs & FPs, and everything looks fine:

java -Xmx8g -jar $gatk \ -T VariantRecalibrator \ -R $ref \ -input $vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap \ -resource:omni,known=false,training=true,truth=false,prior=12.0 $onek \ -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 $dbsnp \ --minNumBadVariants 1000 \ -an QD \ -an MQRankSum \ -an ReadPosRankSum \ -an MQ \ -an FS \ -an DP \ -mode SNP \ -recalFile ${haplocall_date}_all_sites.snp.recal \ -tranchesFile ${haplocall_date}_all_sites.snp.tranches \ --rscript_file ${haplocall_date}_all_sites.snp.rscript \ 1>snp_recal_all_sites.out 2>snp_recal_all_sites.err &

But when I add in the following flags after '-mode SNP' to change the tranches, the plot displays them out of order (and the Ti/Tv ratio doesn't seem to correlate as I expected):

-tranche 80.0 -tranche 85.0 -tranche 90.0 -tranche 99.0 \

This means the cumulative FP/TP display is no longer appropriate and is making it hard to tell what the actual proportions are at each level. When I specify fewer tranches, it displays them in the same descending order no matter how many and in what order I specify them:

The total number of variants that get through the filter also seems to change depending on what tranches are specified (eg the 85 cutoff has ~50K in the second graph and ~30 in the third graph); I would have thought it would be the same for the same dataset, but I might be misunderstanding something.

1-Why does the number of variants change when the same tranch is called in a different set of tranches? (And the Ti/Tv ratio seems to decrease regardless of whether the tranch is above or below 90)?

2- Why does the default tranch plot show ascending order of the tranches, but when I specify my own cutoffs it never does?

3- Alternatively, is there a way to remove the display of the cumulative TP/FPs to make the plot easier to read?

4- Or perhaps a simpler solution, can I bypass the plot altogether? Does one of the outputs from VariantRecal have the data (or a way of calculating it) about TP and FP predicted variants that is used to produce these plots so I can just compare the numbers?

Thanks very much, and let me know if I need to clarify anything