Hi,
I am currently working with a project where we have sequenced a library of approximately 70 bps insert sizes using 2x100 paired-end seq. While this can seem unnecessary, it can improve base qualities a lot.
I have used SeqPrep (https://github.com/jstjohn/SeqPrep) which strips adaptors and merges reads that overlap, in our case the entire read most of the times. This also boosts the base qualities, if a base was sequenced twice, the quality improves quite a bit. This way, base qualities can stretch up to 70 and over (probability of error 0.0001 x 0.0001 if both reads had Q40 at that base, it merged qual = 80). No funny business there. :)
However, this does not seem to play nicely with GATK. The realignment crashes (see below) saying the the base quals must be erroneous. In my case, but they are correct. Can I force GATK to work with these BQs? (--validation_strictness LENIENT didn't help as you can see below :)
cheers Daniel Klevebring
INFO 13:04:07,408 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:04:07,411 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-7-g5e89f01, Compiled 2013/03/06 01:01:28
INFO 13:04:07,411 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 13:04:07,411 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 13:04:07,416 HelpFormatter - Program Args: -T RealignerTargetCreator -I /scratch/3041404/P394_102.prmdup.bam -R /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/human_g1k_v37_clean.fasta -o /scratch/3041404/P394_102.realn.intervals --intervals /bubo/proj/b2010040/private/GoldenPath/NG_design/1000G_REF_picard_custom_design_target_regions_HG19.bed.interval_list --validation_strictness LENIENT
INFO 13:04:07,416 HelpFormatter - Date/Time: 2013/03/13 13:04:07
INFO 13:04:07,416 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:04:07,416 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:04:08,461 GenomeAnalysisEngine - Strictness is LENIENT
INFO 13:04:08,632 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 13:04:08,640 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 13:04:08,655 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 13:04:09,782 IntervalUtils - Processing 39772003 bp from intervals
INFO 13:04:10,001 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
INFO 13:04:10,262 GenomeAnalysisEngine - Done creating shard strategy
INFO 13:04:10,262 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 13:04:10,263 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 13:04:18,482 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.4-7-g5e89f01):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/scratch/3041404/P394_102.prmdup.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 70; please see the GATK --help documentation for options related to this error
##### ERROR ------------------------------------------------------------------------------------------