Tagged with #bqrs
0 documentation articles | 0 announcements | 3 forum discussions


No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (3)

Hi, I am trying to do a base quality score recalibration of my BAM files. I am working with individual based RADseq data where the reads were cleaned and aligned against a reference genome using bowtie2. The reference genome comes from a relatively closely related taxon (~25 Mio years of divergence), however the species in this taxonomic family underwent a recent genome duplication, hence I am running bowtie2 with the -k option (which retains multiple alignments rather than picking randomly one if multiple matches of equal quality occur) and only retain the uniquely aligned reads for the downstream analyses. The downside to this is, that the MAPQ values become artificial.

After generating a recalibration data file, I run GATK PrintReads (version 3.1-1-g07a4bf8) as follow:

java -Xmx16G -jar GenomeAnalysisTK.jar -T PrintReads -R Ref_genome.fasta -I Individual.bam -BQSR Library_recaldata.grp -o Individual.rc.bam

Which yields the following error after running through 80% of the file: # ERROR MESSAGE: Exception when processing alignment for BAM index HWI-ST1206:32:C3VGTACXX:5:1104:14976:28487 84b aligned read.

Here the read in question (the one in the middle):

HWI-ST1206:32:C3VGTACXX:5:2313:18281:8513   16  chrUn   536921741   0   84M *   0   0   CCCAGAGTGCAAAGAACCTTGGCAGGACACTAGACAACACCCTGTTGTTTTCTGCAAACATCAAAGCAGTGACTCTGTCCTGCA    DDDDEEEEEEFFFD@DHFGHHJHIIIIIGGJIJJIGGHHJJJJJJJJIJJJJJJIJJJJJIIJJJJJJJJJJJJJJJHHHHHFF    AS:i:-29    XS:i:-35    XN:i:0  XM:i:5  XO:i:0  XG:i:0  NM:i:5  MD:Z:11T18G1A14G28C7    YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:2314:6453:46729   16  chrUn   536921741   0   84M *   0   0   CCCAGAGTGCAAAGAACCTTGGCAGGACACTAGACAACACCCTGTTGTTTTCTGCAAACATCAAAGCAGTGACTCTGTCCTGCA DDDDEEEEEEFFFEDFHHHHHJJIJJIJIIIIIGJJIFFJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIHJJJJJHHHHHFF   AS:i:-29    XS:i:-35    XN:i:0  XM:i:5  XO:i:0  XG:i:0  NM:i:5  MD:Z:11T18G1A14G28C7    YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1104:14976:28487  16  chrUn   536961523   255 84M *   0   0   ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTACTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA</del>  DDDDDDDDDDDDDDDDDDDDDCEEEEEFFFFFHFFJJJJJJJJJIIHD8JJJJIHIJJJJJJIJJJJJJIJJIJJJJHHHHHFF    AS:i:-9 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:20G27T35   YT:Z:UU RG:Z:70130.satrFamilyX**
HWI-ST1206:32:C3VGTACXX:5:1104:10760:63491  16  chrUn   536961523   255 84M *   0   0   ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTTCTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA    DDDDCCCDDDDDDBCDCDCDC@DDEEECEEBFEFEHJJJJJJIHGJJIJJJJIHFJJJJIIJIJJIJJJIJJJJJJJHHHHHFF    AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:20G63  YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1104:16496:63907  16  chrUn   536961523   255 84M *   0   0   ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTTCTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA    DDDDDDDDDDDDDDDDDDCCCCDDEEECEEFFFFFHJJJIJJJJJJJJJJJJJIHJJJJJJJJJJIJJJJJJJJJIJHHHHHFF    AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:20G63  YT:Z:UU RG:Z:70130.satrFamilyX

I subsequently verified my BAM files with Picard, which yields the following error # "bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned" for 84554 reads that are almost consecutive in the same genomic region (chrUn - which is composed of sequences for which no chromosome anchoring information exist). However, the reads flagged by Picard occur about 2 million base pairs before the region in question for GATK and PrintReads seemed to have no problem with them.

For the sake of completeness, here the beginning and end of the section that was flagged by Picard. It starts with the second of the following reads:

HWI-ST1206:32:C3VGTACXX:5:2315:1142:72568   256 chrUn   997157147   2   84M *   0   0   TGCAGGACCACAGTCTTTTCACCAGCTCCTCAAAATGGTTGAAGAAGTAGTATTTTGGGGGCTCGTATAAACTCGTATGCCGTC    FFHHHHHJJIJJJDFHHGIHIGIJJIJJJIIJJJIIJGIJIGIFHDHCGHCGGHIJIIIIEFDDBDDCDEDAAC?ABCDCCBB@    AS:i:-50    XS:i:-50    XN:i:0  XM:i:10 XO:i:0  XG:i:0  NM:i:10 MD:Z:2T41T8C7A7G0G1A2G3A2G1 YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1102:4043:34769   16  chrUn   997184926   255 84M *   0   0   TCGCCTACAAACAAAACCCCACCCATGCACACCTGATTGGTGCACGTAAAACTGGGAAAGAAGTTACCTCCCTGGAAGCCTGCA    DDDDDEDDDDDDDB@BDDEECHHIIIJHHIHJIHJIGJJJJJJIIJIJJJJIJJJIJJJJJJJIGCJJIJJJJJIJJHHHHHFF    AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:41T15G26   YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1105:2829:22242   16  chrUn   997184926   255 84M *   0   0   TCGCCTACAAACAAAACCCCACCCATGCACACCTGATTGGTGCACGTAAAACTGGGAAAGAAGTTACCTCCCTGGAAGCCTGCA    DDDDDEDDDDDDB@?;DDEE?HHIIGIIHHGJIIJJJJJJJJIIJJJJJJJJJJJJJJJIJIIHE?JIGJJJJJJJJHHHHHFF    AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:41T15G26   YT:Z:UU RG:Z:70130.satrFamilyX

And stops at the following read (note the second read here is ok):

HWI-ST1206:32:C3VGTACXX:5:2314:10564:44125  16  chrUn   1110931683  255 84M *   0   0   GCTAGCTGACTACTGTAGCCCCAGCCCTTATTAGCCTGCACTGTTATAAGGTGTTATGATTCACACACATTGAAATGGCCTGCA    DCDDDCECCCCCBDFFFHHHHGCIHHHFGF@FGHFIIGGGHJIJJJJJJJIJJJJJIJJJIHFIIJIJJJJJJJJJJHHHGHFF    AS:i:-10    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:37A39A6    YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1101:4103:100272  256 chrUn   1110931763  0   84M *   0   0   TGCAGGACAATGCCAGTGAATGGAGCCTAATTACCCTCTGTAATGAAATGTCACATTTTTGTTTCTGGGTCCTGAATAGATGTG    FFHHHHHJJJJJJJJJGIIJJJJIJJJJJJIJJJJJJJJJJJJJJJJJJIJJJJIIJJJJIJJJJJJIHHHFHFFFFFFEEEEE    AS:i:-35    XS:i:-35    XN:i:0  XM:i:6  XO:i:0  XG:i:0  NM:i:6  MD:Z:13A34A5G4C1C11A10  YT:Z:UU RG:Z:70130.satrFamilyX

I personally cannot find something specific that would explain why PrintReads does not work. The pipeline that I use works perfectly when I use an assembly based on my RADseq data, yet not with the reference genome. Thus, I would like to ask, if you could perhaps help me. Thanks!

Comments (3)

Hi all, I'm trying to perform BQRS on a bam file I have. Unfortunately I get this error:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: START (0) > (-1) STOP -- this should never happen, please check read: HWI-ST1296:110:C1P16ACXX:8:2116:15747:68300 2/2 101b aligned read. (CIGAR: 5D74M27S)

The culprit appears to be this read pair:

HWI-ST1296:110:C1P16ACXX:8:2116:15747:68300 147 chr2    230419662   24  5D74M27S    =   230419667   -74 GAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGTGAAAGGAAGGGAAAAGAAAAAGGAAAGGAAGGCAATCCCTGCCCAGGTTCTTAATTTTC   #####A>:3CC>;=>DC@;>66;.3@DAA>CA@77.)((./))@FBB.<<??/9*0*******00*?1***1*1)1)<)B3++2+2+22++2222+4++B=   PG:Z:MarkDuplicates RG:Z:GB_L008.1  NM:i:9  AS:i:43 XS:i:45
HWI-ST1296:110:C1P16ACXX:8:2116:15747:68300 99  chr2    230419667   53  83M18S  =   230419662   74  GAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAGAAAGGAAGGAAAAAGAGAAAGGGAAGGAAGGAAATTCATGCTCAGTATCTAATTTTTA   ??;ADDDDHBF3DA=GBGB@D;EFC<3CDHBHICDEGEGIE=@@A@D@;C;;2?@7;=7>>;5>;;@B1<@CB<?>>::4++>@@>CC44>>@DC(:CA##   PG:Z:MarkDuplicates RG:Z:GB_L008.1  NM:i:0  AS:i:83 XS:i:50

Program arguments were:

-R /lustre1/genomes/hg19/fa/hg19.fa -knownSites /lustre1/genomes/hg19/annotation/dbSNP-137.chr.vcf -I GB_dedup.realign.bam -T BaseRecalibrator --covariate QualityScoreCovariate --covariate CycleCovariate --covariate ContextCovariate --covariate ReadGroupCovariate --unsafe ALLOW_SEQ_DICT_INCOMPATIBILITY -nct 64 -o jobdir/GB.grp

I'm running the latest nightbuild of GATK. Any hint is very much appreciated

d

Comments (1)

Hi all, I would like to know which best practices are available for BQRS only and, before that, I need some detail on how BQRS works. In particular: suppose I have exome data for multiple samples and a set of intervals that covers all exome baits. One way to perform BQRS is to run the BaseRecalibrator walker on the whole file, using all BAM files available and have a single covariate table; I run PrintReads on each BAM file using the covariate table. Another way, is to run in the very same way feeding with the file containing intervals. This speeds things up because the walker doesn't have to check the whole genome space. Another way, faster, is to run a single BaseRecalibrator process for each interval. This results in a number of covariate tables equal to the number of intervals. I then run the same number of PrintReads and merge the results. If the latter would be enough, I know I can run really fast, but I'm afraid I may get some biased covariate table. I may apply the same procedure on whole genome, choosing the proper set of intervals