Tagged with #readgroup
2 documentation articles | 0 announcements | 7 forum discussions

Comments (4)


Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but this order is recommended.


  • Installed Picard tools


  1. Sort the aligned reads by coordinate order
  2. Mark duplicates
  3. Add read group information
  4. Index the BAM file


You may ask, is all of this really necessary? The GATK is notorious for imposing strict formatting guidelines and requiring the presence of information such as read groups that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.

1. Sort the aligned reads by coordinate order


Run the following Picard command:

java -jar SortSam.jar \ 
    INPUT=unsorted_reads.bam \ 
    OUTPUT=sorted_reads.bam \ 

Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

2. Mark duplicate reads


Run the following Picard command:

java -jar MarkDuplicates.jar \ 
    INPUT=sorted_reads.bam \ 
    OUTPUT=dedup_reads.bam \

Expected Results

This creates a file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also creates a file called metrics.txt that contains metrics regarding duplication of the data.

More details

During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.

3. Add read group information


Run the following Picard command:

java -jar AddOrReplaceReadGroups.jar  \ 
    INPUT=dedup_reads.bam \ 
    OUTPUT=addrg_reads.bam \ 
    RGID=group1 RGLB= lib1 RGPL=illumina RGPU=unit1 RGSM=sample1 

Expected Results

This creates a file called addrg_reads.bam with the same content as the input file, except that the reads will now have read group information attached.

4. Index the BAM file


Run the following Picard command:

java -jar BuildBamIndex.jar \ 

Expected Results

This creates an index file called addrg_reads.bai, which is ready to be used in the Best Practices workflow.

Since Picard tools do not systematically create an index file when they output a new BAM file (unlike GATK tools, which will always output indexed files), it is best to keep the indexing step for last.

Comments (0)

This utility replaces read groups in a BAM file

It is useful for fixing problems such as not having read groups in a bam file.

java -jar picard/AddOrReplaceReadGroups.jar I= testdata/exampleNORG.bam O= exampleNewRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=DePristo

Note that this tool is now part of the Picard package: http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups

This tool can fix BAM files without read group information:

# throws an error
java -jar dist/GenomeAnalysisTK.jar -R testdata/exampleFASTA.fasta -I testdata/exampleNORG.bam -T UnifiedGenotyper 

# fix the read groups
java -jar picard/AddOrReplaceReadGroups.jar I= testdata/exampleNORG.bam O= exampleNewRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=DePristo CREATE_INDEX=True

# runs without error
java -jar dist/GenomeAnalysisTK.jar -R testdata/exampleFASTA.fasta -I exampleNewRG.bam -T UnifiedGenotyper
No posts found with the requested search criteria.
Comments (4)

Hi, I was wondering about how GATK handles bam read group info, as behaviour I have observed is not as expected from reading the documentation.
I am running a comparison of different illumina platforms, using the same library, so have given my bam files RG info that I think reflects this:

@RG ID:2_1  PL:ILLUMINA PU:Platform1_FCA    LB:Sample2  DT:2013-09-01T01:00:00+0100 SM:Sample2_Platform1
@RG ID:2_2  PL:ILLUMINA PU:Platform2_FCB    LB:Sample2  DT:2013-07-17T01:00:00+0100 SM:Sample2_Platform2
@RG ID:2_3  PL:ILLUMINA PU:Platform3_FCC    LB:Sample2  DT:2014-05-16T01:00:00+0100 SM:Sample2_Platform3

(IDs changed to something more generic)

I have run BaseRecalibrator, and the read group results are broken down by the PU field:

ReadGroup                    EventType  EmpiricalQuality  EstimatedQReported  Observations   Errors     
Platform1_FCA    M                   28.2433             28.1998  6283676400.00   9416437.03
Platform1_FCA    I                   41.2350             45.0000  6283676400.00    472843.68
Platform1_FCA    D                   40.2707             45.0000  6283676400.00    590401.90
Platform2_FCB  M                   28.7442             30.5860   258157515.00    344716.76
Platform2_FCB  I                   44.4728             45.0000   258157515.00      9216.35
Platform2_FCB  D                   41.1310             45.0000   258157515.00     19896.01
Platform3_FCC     M                   22.9983             23.2025  2510670817.00  12588158.43
Platform3_FCC     I                   44.0934             45.0000  2510670817.00     97824.77
Platform3_FCC     D                   41.1932             45.0000  2510670817.00    190753.01

This is not what I would expect given that the FAQ for GATK states We do not require value for the CN, DS, DT, PG, PI, or PU fields. I would expect that the RG ID field is used, that is the unique identifier for the Read group, and is by definition unique. Failing that I would expect the SM field to be used as this is stated as being used in the GATK documentation and is unique in my bam file. I'm guessing GATK would use the LB field preferentially but as this is not unique in my case it uses something else that is. I can find no errors or warnings about this in the output or stdout/stderr.

I'm wondering what would happen when I add a second sample, with the same PU IDs?

Can anyone clarify how GATK BaseRecalibrator (and other tools) handle read group fields, especially where LB is the same for multiple read groups? I need to know if I am using the RG fields correctly for my experiment, and also so that the end results of the GATK pipeline best reflect what I am trying to analyse.

Many thanks Anna

Comments (12)


I have a bam file with multiple read groups for same sample. Does variant calling algorithm (UnifiedGenotyper) will consider bam file as multiple-sample data or a single sample-data (irrespective of read groups) for calling varaints?


Read Groups in BAM file:

@RG     ID:41852        PL:illumina     PU:41852        LB:nolib        SM:41852p
@RG     ID:41852.1      PL:illumina     PU:41852        LB:nolib        SM:41852s
@RG     ID:41853        PL:illumina     PU:41853        LB:nolib        SM:41853s
@RG     ID:41854        PL:illumina     PU:41854        LB:nolib        SM:41854p
@RG     ID:41854.4      PL:illumina     PU:41854        LB:nolib        SM:41854s
@RG     ID:41855        PL:illumina     PU:41855        LB:nolib        SM:41855p
@RG     ID:41855.6      PL:illumina     PU:41855        LB:nolib        SM:41855s

Variants call in VCF file

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  41852p  41852s  41853s  41854p  41854s  41855p  41855s

chrM    150     .       T       C       41679.01        PASS    ABHom=0.999;AC=14;AF=1.00;AN=14;BaseQRankSum=1.362;DP=1255;DS;Dels=0.00;FS=0.000;HaplotypeScore=4.6324;MLEAC=14;MLEAF=1.00;MQ=40.73;MQ0=1;MQRankSum=-0.678;OND=3.149e-03;QD=33.21;ReadPosRankSum=1.479;SB=-2.052e+04    GT:AD:DP:GQ:PL  1/1:0,200:200:99:6075,544,0     1/1:0,200:200:99:6315,568,0     1/1:0,200:200:99:7094,599,0     1/1:1,197:198:99:6820,547,0     1/1:0,113:113:99:3624,322,0     1/1:0,200:200:99:7111,599,0     1/1:0,141:141:99:4640,403,0



Comments (1)

Hi, I have two bam files from one case and one control. They were both mapped with bwa. In their bam files, their readgroup tags are with different sample ID (SM), but with the same library ID (LB).

I then fed the two bam files into GATK together for realignment, and the output is a merged bam for the two samples. I am just a bit unsure, if the two samples with the same LB id will affect the downstream variant calls? can the caller can distinguish them by their different sample IDs?

Thank you very much for your kind guidance!

Comments (4)

hello every one please i got this error while running this command

elendin@elendin-HP-Pavilion-dv6700-Notebook-PC:~/analysis of rnaseq bamfiles$ java -jar GenomeAnalysisTK.jar -R VitisVinifera.fasta -T SomaticIndelDetector -o indels.vcf -verbose indels.txt -I:normal wt.bam -I:tumor newmut.bam

MESSAGE: SAM/BAM file SAMFileReader{/home/elendin/analysis of rnaseq bamfiles/newmut.bam} is malformed: Read HWI-ST132_0461:3:2201:1211:140854#GTCCTA is either missing the read group or its read group is not defined in the BAM header, both of which are required by the GATK. Please use http://www.broadinstitute.org/gsa/wiki/index.php/ReplaceReadGroups to fix this problem

ERROR ------------------------------------------------------------------------------------------

how can i add the missing read group for HWI-ST132_0461:3:2201:1211:140854#GTCCTA or define it in the header. thanks

Comments (3)


I am new at using GATK (v 2.1-3). I do exome sequencing by sample using the following steps: Alignment with BWA (0.6.2) GATK :Local realignment around INDELs PICARD (1.67): FixMateInformation GATK: Recalibration (BaseRecalibrator + PrintReads -BQSR) Samtools for calling variants

Samtools seems to run properly but no file (*.vcf and *.bcf) are created and no error message is prompted :

cd Sample_09 + samtools mpileup -BE -ug -q 20 -Q 20 -D -f human_g1k_v37.fasta realigned_fixed_recal.bam -C50 + bcftools view -bvcg - [mpileup] 1 samples in 1 input files Set max per-file depth to 8000 [bcfview] 100000 sites processed. [afs] 0:89274.054 1:6411.053 2:4314.893 [bcfview] 200000 sites processed. [afs] 0:89100.642 1:6125.883 2:4773.474 [bcfview] 300000 sites processed. [afs] 0:87374.996 1:7439.238 2:5185.766 [bcfview] 400000 sites processed. [afs] 0:87890.186 1:7087.628 2:5022.185 [bcfview] 500000 sites processed. [afs] 0:85322.061 1:8454.843 2:6223.096 [bcfview] 600000 sites processed. [afs] 0:85864.368 1:8410.777 2:5724.854 [bcfview] 700000 sites processed. [afs] 0:88813.814 1:6828.001 2:4358.185 [bcfview] 800000 sites processed. [afs] 0:89070.318 1:6302.924 2:4626.758 [bcfview] 900000 sites processed. [afs] 0:88364.380 1:6796.962 2:4838.658 [bcfview] 1000000 sites processed. [afs] 0:86892.531 1:7268.088 2:5839.381 [bcfview] 1100000 sites processed. [afs] 0:87184.845 1:7153.073 2:5662.081 [bcfview] 1200000 sites processed. [afs] 0:86762.756 1:7241.236 2:5996.008 [bcfview] 1300000 sites processed. [afs] 0:89346.143 1:6159.989 2:4493.868 [bcfview] 1400000 sites processed. [afs] 0:88658.355 1:7160.555 2:4181.089 [bcfview] 1500000 sites processed. [afs] 0:85985.305 1:8308.039 2:5706.656 [bcfview] 1600000 sites processed. [afs] 0:87346.636 1:7708.883 2:4944.480 [afs] 0:63097.202 1:3950.127 2:3572.670 + bcftools view .bcf

+ cd ..

I have seen that some groups use after realignment Picard:AddOrReplaceReadGroups and I wonder if I should use before calling the variant with samtools.

Thanks in advance for any advice you can give me.


Comments (1)

ERROR MESSAGE: SAM/BAM file genome_110616_SN365_A_s_7_seq_GQJ-1.pe.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups

i am very new to GATK and i was trying to invoke the readcount command and i got the error above what is read groups.

thank you