Tagged with #solid
0 documentation articles | 0 announcements | 7 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (2)


I'm having trouble removing duplicates using Picard tools on SOLiD data. I get a regex not matching error.

The reads have the following names:





And I don't think Picard tools is able to pick these read names with its default regex.

I tried to change the default regex. This time it does not throw an error, but it takes too long and times out (out of memory). I suspect I'm not giving the right regex. Here is my command:

java -jar $PICARD_TOOLS_HOME/MarkDuplicates.jar I=$FILE O=$BAMs/MarkDuplicates/$SAMPLE.MD.bam M=$BAMs/MarkDuplicates/$SAMPLE.metrics READ_NAME_REGEX="([0-9]+)([0-9]+)([0-9]+).*"

Any help is appreciated. Thanks!

Comments (15)

Dear all, I want to call SNPs from Solid data (SE=35) using GATK recent version (3.2.2), but I got the following errors which is regarding to RealignerTargetCreator function: when using argument -fixMisencodedQual ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool without -fixMisencodedQual ERROR MESSAGE: SAM/BAM file SAMFileReader{XXXXX} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 62; please see the GATK --help documentation for options related to this error when using argument -allowPotentiallyMisencodedQuals, it can run well.

All my command like following, bfast match -f ref.fa -r 1.fastq -A 1 -n 16 >1.aligned.bmf

bfast localalign -f ref.fa -m 1.aligned.bmf -A 1 -n 16 >2.aligned.baf

bfast postprocess -f ref.fa -i 2.aligned.baf -A 1 -Y 2 -n 16 -b 0 >2.sam

java -Xmx60g -jar /bin/picard-tools-1.118/AddOrReplaceReadGroups.jar INPUT=2.sam OUTPUT=2.bam SORT_ORDER=coordinate RGID=OS RGLB=OS RGPL=solid RGPU=SRR035385 RGSM=OS

java -Xmx60g -jar /bin/picard-tools-1.118/MarkDuplicates.jar INPUT=2.bam OUTPUT=2rdup.bam METRICS_FILE=2rdup REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES=2000

java -jar /bin/GATK-3.2-2/GenomeAnalysisTK.jar -R ref.fa -T RealignerTargetCreator -I 2rdup.bam -o 2.realn.intervals -nt 8 -allowPotentiallyMisencodedQuals ###I got errors here

Can I get a correct VCF file when I using argument -allowPotentiallyMisencodedQuals in the following command! While, some wrong commands may lead to this problem, please point them out.

I hope someone can help me with my questions, thank you!

Comments (3)

We ran a recent version of Haplotyper Caller on our SOLiD targeted resequencing data and got a ridiculous number of indels. We took a closer look at some and there was absolutely no evidence for an indel at a called position, and wondered whether the internal realignment was doing something weird? Is this a known problem for SOLiD data? Our Illumina data works much better. It makes us now wary of using GATK for SOLiD data...is it just a filtering thing?

Comments (26)

Hello dear GATK Team,

since Version 2.3 I get the following error with some Lifescope 2.5 mapped SoLID exome Bam files: "[...]appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 64; please see the GATK --help documentation for options related to this error".

After carefully seaching the forum I found this discussion: gatkforums.broadinstitute.org/discussion/1592/baserecalibrator-error where ebanks offered the "--allow_potentially_misencoded_quality_scores" argument as solution. Actually this seemed to work at first, all walkers with the argument applied don't crash any more.

The Problem is that UnifiedGenotyper and HaplotypeCaller seem to somehow ignore the reads (or something else...) because in these exomes both call only about 3000 variants, allthough they seem to process the whole file judged by the runtime and logfiles.

The exomes used to work and had normal calls prior to GATK 2.3.

Any ideas?

(the argument "--fix_misencoded_quality_scores" / "-fixMisencodedQuals" as mentioned in this post: gatkforums.broadinstitute.org/discussion/1991/version-highlights-for-gatk-version-2-3 messes things up more for the Lifescope BAMs)



Comments (6)

Does GATK BaseRecalibrator work with Bam files produces with the SOLID Lifescope mapper?

You show in the a base quality recalibration presentation that recalibration also should work on SOLID data. But you don't mention if it also works for Bam files produced with lifescope. BWA mapping quality is from 0-37 , Lifescope mapping quality is from 0 - 95.

I get an ArrayIndexOutOfBoundsException on the lifescope Bam files.

`##### ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -92 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:225) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ----------------------------------------------`
Comments (3)

Hi, I am working on SOLiD 5500xl data and used SHRiMP2.2.3 for performing mapping. The library type is paired-end. I have read some discussions regarding SOLiD problem but I still have some doubts regarding some steps in best practices

  1. Local-Realignment at In-dels: Since local realignments take place in base space instead of color-space I doubt the accuracy of the alignment
  2. Mark/Remove Duplicates: Reads just lying in the same position (end to end) may not necessarily be duplicates. Some of these reads may have putative variants, which otherwise may be filtered out.
  3. Base quality score recalibration: I am not sure whether this is applicable for 5500xl as well, since quality values have slightly changed on 5500 from previous SOLiD versions as far as I know.

So after mapping, I simply used GATK UnifiedGenotyper to call SNPs and InDels under default values. I end up getting around 40 million variants. Is there any way I can get a more refined variant calling? Do you still consider me applying the above pre-processing steps or do you recommend me applying some variant filteratiion on the called variants? If yes for the previous, then could you explain how my above concerns are taken care of? I was trying to look at some general recommended filter values on INFO fields in VCF format such as BQ, MQ, MQ0, DP, SB etc. Do you recommend some generally used values of these fields on which I can filter and hence refine my variant data?

I may have posted a subset of the above question, which I am not sure was posted successfully since at that time I just created an account. If you have already answered this question then I apologize for that. Could you then provide me the link where you answered it?

Thanks in advance

Comments (2)


I am currently working on a Exome sequencing projekt with older single-end SOLiD exomes and new paired-end exomes. In a first attempt (GATK 1.7 and best practices v3 back then) i tried calling and recalibrating all exomes together (at that time 120) without selecting for paired/single-end. As I already had validatet many variants I could check the quality of the calls and got very bad results, especially for InDels (previously called, true positive variants missing). My idee is that the UnifiedGenotyper has Problems mixing paired-end with single-end exomes.

Is there any official recommendation for this problem? My solution right now is to group the exomes in batches (40-50 Exomes) which ran on the same technology.

Also a second Problem/Question: For some individuals exomes where sequenced twice, and for some of these the first run was single-end and the second one was paired. The best practices mentions one should se all available reads for a individual for calling. Do you have any experience on how to handle these cases?

Any help is greatly appreciated!