Tagged with #solid
0 documentation articles | 0 announcements | 8 forum discussions

No articles to display.

No articles to display.

Created 2015-08-07 14:25:49 | Updated | Tags: depthofcoverage haplotypecaller dp solid igv

Comments (5)

Hello Everyone!

I'm using the whole GATK workflow to analyze Target Resequencing data coming from SOLID platforms. I followed the Best Practices for analysis and used the proper SOLID flags when using BaseRecalibrator (--solid_recal_mode SET_Q_ZERO_BASE_N --solid_nocall_strategy PURGE_READ), however, when looking at the VCF files after Haplotype Caller something does not add up.

I checked some of the variants inside some of my samples and i found that the DP field does not report the same per base coverage value than the one that are reported by the bam (using the --bamOutput to produce a bam for Haplotype Caller) when looking at them using the IGV. As far as I understand, for each position there's a downsampling, but I'm see a lower DP value compared to the ones that are stored in the BAM I'm attaching an IGV screenshots of one of the variants in which i'm encountering this problem. I deactivated all filtering alignment options in IGV, as well as downsampling. Here's the line Reported in the VCF for this variant:

chr17 45249306 rs62077265 T C 11069.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.010;ClippingRankSum=-0.616;DB;DP=375;FS=90.048;MLEAC=1;MLEAF=0.500;MQ=59.56;MQRankSum=1.319;QD=29.52;ReadPosRankSum=2.229;SOR=0.016 GT:AD:DP:GQ:PL 0/1:150,224:374:99:11098,0,5080

As you can see from the screenshot, not only the covers differ, but a lot of reads that maps according to the reference are missing- Does somebody has an idea of what happened to the coverage inside the VCF?

Thanks a lot for your time!


Created 2014-12-05 15:12:16 | Updated 2014-12-05 15:12:48 | Tags: solid picard markduplicates

Comments (2)


I'm having trouble removing duplicates using Picard tools on SOLiD data. I get a regex not matching error.

The reads have the following names:





And I don't think Picard tools is able to pick these read names with its default regex.

I tried to change the default regex. This time it does not throw an error, but it takes too long and times out (out of memory). I suspect I'm not giving the right regex. Here is my command:

java -jar $PICARD_TOOLS_HOME/MarkDuplicates.jar I=$FILE O=$BAMs/MarkDuplicates/$SAMPLE.MD.bam M=$BAMs/MarkDuplicates/$SAMPLE.metrics READ_NAMEREGEX="([0-9]+)([0-9]+)_([0-9]+).*"

Any help is appreciated. Thanks!

Created 2014-08-29 13:55:01 | Updated | Tags: solid quality-scores

Comments (15)

Dear all, I want to call SNPs from Solid data (SE=35) using GATK recent version (3.2.2), but I got the following errors which is regarding to RealignerTargetCreator function: when using argument -fixMisencodedQual ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool without -fixMisencodedQual ERROR MESSAGE: SAM/BAM file SAMFileReader{XXXXX} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 62; please see the GATK --help documentation for options related to this error when using argument -allowPotentiallyMisencodedQuals, it can run well.

All my command like following, bfast match -f ref.fa -r 1.fastq -A 1 -n 16 >1.aligned.bmf

bfast localalign -f ref.fa -m 1.aligned.bmf -A 1 -n 16 >2.aligned.baf

bfast postprocess -f ref.fa -i 2.aligned.baf -A 1 -Y 2 -n 16 -b 0 >2.sam

java -Xmx60g -jar /bin/picard-tools-1.118/AddOrReplaceReadGroups.jar INPUT=2.sam OUTPUT=2.bam SORT_ORDER=coordinate RGID=OS RGLB=OS RGPL=solid RGPU=SRR035385 RGSM=OS

java -Xmx60g -jar /bin/picard-tools-1.118/MarkDuplicates.jar INPUT=2.bam OUTPUT=2rdup.bam METRICS_FILE=2rdup REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES=2000

java -jar /bin/GATK-3.2-2/GenomeAnalysisTK.jar -R ref.fa -T RealignerTargetCreator -I 2rdup.bam -o 2.realn.intervals -nt 8 -allowPotentiallyMisencodedQuals ###I got errors here

Can I get a correct VCF file when I using argument -allowPotentiallyMisencodedQuals in the following command! While, some wrong commands may lead to this problem, please point them out.

I hope someone can help me with my questions, thank you!

Created 2014-02-03 15:43:45 | Updated | Tags: haplotypecaller indels solid

Comments (3)

We ran a recent version of Haplotyper Caller on our SOLiD targeted resequencing data and got a ridiculous number of indels. We took a closer look at some and there was absolutely no evidence for an indel at a called position, and wondered whether the internal realignment was doing something weird? Is this a known problem for SOLiD data? Our Illumina data works much better. It makes us now wary of using GATK for SOLiD data...is it just a filtering thing?

Created 2013-02-20 21:34:20 | Updated 2013-02-20 21:36:38 | Tags: solid lifescope

Comments (26)

Hello dear GATK Team,

since Version 2.3 I get the following error with some Lifescope 2.5 mapped SoLID exome Bam files: "[...]appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 64; please see the GATK --help documentation for options related to this error".

After carefully seaching the forum I found this discussion: gatkforums.broadinstitute.org/discussion/1592/baserecalibrator-error where ebanks offered the "--allow_potentially_misencoded_quality_scores" argument as solution. Actually this seemed to work at first, all walkers with the argument applied don't crash any more.

The Problem is that UnifiedGenotyper and HaplotypeCaller seem to somehow ignore the reads (or something else...) because in these exomes both call only about 3000 variants, allthough they seem to process the whole file judged by the runtime and logfiles.

The exomes used to work and had normal calls prior to GATK 2.3.

Any ideas?

(the argument "--fix_misencoded_quality_scores" / "-fixMisencodedQuals" as mentioned in this post: gatkforums.broadinstitute.org/discussion/1991/version-highlights-for-gatk-version-2-3 messes things up more for the Lifescope BAMs)



Created 2013-01-23 14:58:26 | Updated | Tags: baserecalibrator solid lifescope

Comments (6)

Does GATK BaseRecalibrator work with Bam files produces with the SOLID Lifescope mapper?

You show in the a base quality recalibration presentation that recalibration also should work on SOLID data. But you don't mention if it also works for Bam files produced with lifescope. BWA mapping quality is from 0-37 , Lifescope mapping quality is from 0 - 95.

I get an ArrayIndexOutOfBoundsException on the lifescope Bam files.

`##### ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -92 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:225) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ----------------------------------------------`

Created 2012-11-09 20:38:55 | Updated 2013-01-07 20:06:07 | Tags: best-practices solid

Comments (3)

Hi, I am working on SOLiD 5500xl data and used SHRiMP2.2.3 for performing mapping. The library type is paired-end. I have read some discussions regarding SOLiD problem but I still have some doubts regarding some steps in best practices

  1. Local-Realignment at In-dels: Since local realignments take place in base space instead of color-space I doubt the accuracy of the alignment
  2. Mark/Remove Duplicates: Reads just lying in the same position (end to end) may not necessarily be duplicates. Some of these reads may have putative variants, which otherwise may be filtered out.
  3. Base quality score recalibration: I am not sure whether this is applicable for 5500xl as well, since quality values have slightly changed on 5500 from previous SOLiD versions as far as I know.

So after mapping, I simply used GATK UnifiedGenotyper to call SNPs and InDels under default values. I end up getting around 40 million variants. Is there any way I can get a more refined variant calling? Do you still consider me applying the above pre-processing steps or do you recommend me applying some variant filteratiion on the called variants? If yes for the previous, then could you explain how my above concerns are taken care of? I was trying to look at some general recommended filter values on INFO fields in VCF format such as BQ, MQ, MQ0, DP, SB etc. Do you recommend some generally used values of these fields on which I can filter and hence refine my variant data?

I may have posted a subset of the above question, which I am not sure was posted successfully since at that time I just created an account. If you have already answered this question then I apologize for that. Could you then provide me the link where you answered it?

Thanks in advance

Created 2012-10-29 11:56:43 | Updated 2012-10-29 12:02:41 | Tags: unifiedgenotyper single-end paired-end exome solid

Comments (2)


I am currently working on a Exome sequencing projekt with older single-end SOLiD exomes and new paired-end exomes. In a first attempt (GATK 1.7 and best practices v3 back then) i tried calling and recalibrating all exomes together (at that time 120) without selecting for paired/single-end. As I already had validatet many variants I could check the quality of the calls and got very bad results, especially for InDels (previously called, true positive variants missing). My idee is that the UnifiedGenotyper has Problems mixing paired-end with single-end exomes.

Is there any official recommendation for this problem? My solution right now is to group the exomes in batches (40-50 Exomes) which ran on the same technology.

Also a second Problem/Question: For some individuals exomes where sequenced twice, and for some of these the first run was single-end and the second one was paired. The best practices mentions one should se all available reads for a individual for calling. Do you have any experience on how to handle these cases?

Any help is greatly appreciated!