Tagged with #readgroup
3 documentation articles | 0 announcements | 7 forum discussions

Created 2015-11-20 19:22:28 | Updated 2015-11-24 21:37:41 | Tags: readgroup addorreplacereadgroups
Comments (0)

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

Meaning of the read group fields required by GATK

  • ID = Read group identifier This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world. Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model. There is another read group tag called PU (referring to the platform unit, holding flowcell lane information) that is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

  • SM = Sample The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name.

  • PL = Platform/technology used to produce the read This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.

  • LB = DNA preparation library identifier MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields.

Deriving ID and PU fields from read names

Here we illustrate how to derive both ID and PU fields from read names. We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.


Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:

Mom's data:

Kid's data:

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

Created 2013-07-03 00:53:53 | Updated 2015-09-24 12:57:53 | Tags: analyst readgroup picard indexing
Comments (4)


Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but this order is recommended.


  • Installed Picard tools


  1. Sort the aligned reads by coordinate order
  2. Mark duplicates
  3. Add read group information
  4. Index the BAM file


You may ask, is all of this really necessary? The GATK is notorious for imposing strict formatting guidelines and requiring the presence of information such as read groups that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.

1. Sort the aligned reads by coordinate order


Run the following Picard command:

java -jar picard.jar SortSam \ 
    INPUT=unsorted_reads.bam \ 
    OUTPUT=sorted_reads.bam \ 

Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

2. Mark duplicate reads


Run the following Picard command:

java -jar picard.jar MarkDuplicates \ 
    INPUT=sorted_reads.bam \ 
    OUTPUT=dedup_reads.bam \

Expected Results

This creates a file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also creates a file called metrics.txt that contains metrics regarding duplication of the data.

More details

During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.

3. Add read group information


Run the following Picard command:

java -jar picard.jar AddOrReplaceReadGroups \ 
    INPUT=dedup_reads.bam \ 
    OUTPUT=addrg_reads.bam \ 
    RGID=group1 RGLB= lib1 RGPL=illumina RGPU=unit1 RGSM=sample1 

Expected Results

This creates a file called addrg_reads.bam with the same content as the input file, except that the reads will now have read group information attached.

4. Index the BAM file


Run the following Picard command:

java -jar picard.jar BuildBamIndex \ 

Expected Results

This creates an index file called addrg_reads.bai, which is ready to be used in the Best Practices workflow.

Since Picard tools do not systematically create an index file when they output a new BAM file (unlike GATK tools, which will always output indexed files), it is best to keep the indexing step for last.

Created 2012-07-23 18:04:20 | Updated 2015-11-20 19:40:12 | Tags: bam readgroup utilities addorreplacereadgroups
Comments (0)

What are read groups?

See the Dictionary entry on read groups.

Errors about missing or undefined read groups

As detailed in the FAQs about input requirements, GATK expects all read groups appearing in the read data to be specified in the file header, and will fail with an error if it does not find that information (whether there is no read group information in the file, or a subset of reads do not have read groups).


You can use a Picard tool called AddOrReplaceReadGroups to add the missing information to your input file.

Here's an example:

# throws an error
java -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R reference.fasta \
    -I reads_without_RG.bam \
    -o output.vcf

# fix the read groups
java -jar picard.jar AddOrReplaceReadGroups \
    I= reads_without_RG.bam \
    O=  reads_with_RG.bam \
    SORT_ORDER=coordinate \
    RGID=foo \
    RGLB=bar \
    RGPL=illumina \
    RGSM=Sample1 \

# runs without error
java -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R reference.fasta \
    -I reads_with_RG.bam \
    -o output.vcf

Note that if you don't know what information to put in the read groups, you should ask whoever performed the sequencing or provided the BAM to give you the metadata you need.

No posts found with the requested search criteria.

Created 2014-09-08 11:15:00 | Updated 2014-09-08 11:16:08 | Tags: baserecalibrator readgroup header rg pu lb
Comments (4)

Hi, I was wondering about how GATK handles bam read group info, as behaviour I have observed is not as expected from reading the documentation.
I am running a comparison of different illumina platforms, using the same library, so have given my bam files RG info that I think reflects this:

@RG ID:2_1  PL:ILLUMINA PU:Platform1_FCA    LB:Sample2  DT:2013-09-01T01:00:00+0100 SM:Sample2_Platform1
@RG ID:2_2  PL:ILLUMINA PU:Platform2_FCB    LB:Sample2  DT:2013-07-17T01:00:00+0100 SM:Sample2_Platform2
@RG ID:2_3  PL:ILLUMINA PU:Platform3_FCC    LB:Sample2  DT:2014-05-16T01:00:00+0100 SM:Sample2_Platform3

(IDs changed to something more generic)

I have run BaseRecalibrator, and the read group results are broken down by the PU field:

ReadGroup                    EventType  EmpiricalQuality  EstimatedQReported  Observations   Errors     
Platform1_FCA    M                   28.2433             28.1998  6283676400.00   9416437.03
Platform1_FCA    I                   41.2350             45.0000  6283676400.00    472843.68
Platform1_FCA    D                   40.2707             45.0000  6283676400.00    590401.90
Platform2_FCB  M                   28.7442             30.5860   258157515.00    344716.76
Platform2_FCB  I                   44.4728             45.0000   258157515.00      9216.35
Platform2_FCB  D                   41.1310             45.0000   258157515.00     19896.01
Platform3_FCC     M                   22.9983             23.2025  2510670817.00  12588158.43
Platform3_FCC     I                   44.0934             45.0000  2510670817.00     97824.77
Platform3_FCC     D                   41.1932             45.0000  2510670817.00    190753.01

This is not what I would expect given that the FAQ for GATK states We do not require value for the CN, DS, DT, PG, PI, or PU fields. I would expect that the RG ID field is used, that is the unique identifier for the Read group, and is by definition unique. Failing that I would expect the SM field to be used as this is stated as being used in the GATK documentation and is unique in my bam file. I'm guessing GATK would use the LB field preferentially but as this is not unique in my case it uses something else that is. I can find no errors or warnings about this in the output or stdout/stderr.

I'm wondering what would happen when I add a second sample, with the same PU IDs?

Can anyone clarify how GATK BaseRecalibrator (and other tools) handle read group fields, especially where LB is the same for multiple read groups? I need to know if I am using the RG fields correctly for my experiment, and also so that the end results of the GATK pipeline best reflect what I am trying to analyse.

Many thanks Anna

Created 2013-01-14 11:27:11 | Updated 2013-01-14 14:32:46 | Tags: readgroup
Comments (12)


I have a bam file with multiple read groups for same sample. Does variant calling algorithm (UnifiedGenotyper) will consider bam file as multiple-sample data or a single sample-data (irrespective of read groups) for calling varaints?


Read Groups in BAM file:

@RG     ID:41852        PL:illumina     PU:41852        LB:nolib        SM:41852p
@RG     ID:41852.1      PL:illumina     PU:41852        LB:nolib        SM:41852s
@RG     ID:41853        PL:illumina     PU:41853        LB:nolib        SM:41853s
@RG     ID:41854        PL:illumina     PU:41854        LB:nolib        SM:41854p
@RG     ID:41854.4      PL:illumina     PU:41854        LB:nolib        SM:41854s
@RG     ID:41855        PL:illumina     PU:41855        LB:nolib        SM:41855p
@RG     ID:41855.6      PL:illumina     PU:41855        LB:nolib        SM:41855s

Variants call in VCF file

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  41852p  41852s  41853s  41854p  41854s  41855p  41855s

chrM    150     .       T       C       41679.01        PASS    ABHom=0.999;AC=14;AF=1.00;AN=14;BaseQRankSum=1.362;DP=1255;DS;Dels=0.00;FS=0.000;HaplotypeScore=4.6324;MLEAC=14;MLEAF=1.00;MQ=40.73;MQ0=1;MQRankSum=-0.678;OND=3.149e-03;QD=33.21;ReadPosRankSum=1.479;SB=-2.052e+04    GT:AD:DP:GQ:PL  1/1:0,200:200:99:6075,544,0     1/1:0,200:200:99:6315,568,0     1/1:0,200:200:99:7094,599,0     1/1:1,197:198:99:6820,547,0     1/1:0,113:113:99:3624,322,0     1/1:0,200:200:99:7111,599,0     1/1:0,141:141:99:4640,403,0



Created 2012-12-24 20:09:25 | Updated 2013-01-07 19:16:08 | Tags: bam readgroup picard
Comments (3)

how to add read group to the bam file using PICARD generated from GS reference mapper BAM file?

Created 2012-09-06 20:17:20 | Updated 2012-09-07 01:55:51 | Tags: readgroup
Comments (1)

Hi, I have two bam files from one case and one control. They were both mapped with bwa. In their bam files, their readgroup tags are with different sample ID (SM), but with the same library ID (LB).

I then fed the two bam files into GATK together for realignment, and the output is a merged bam for the two samples. I am just a bit unsure, if the two samples with the same LB id will affect the downstream variant calls? can the caller can distinguish them by their different sample IDs?

Thank you very much for your kind guidance!

Created 2012-09-04 22:27:10 | Updated 2013-01-07 20:41:31 | Tags: somaticindeldetector readgroup
Comments (4)

hello every one please i got this error while running this command

elendin@elendin-HP-Pavilion-dv6700-Notebook-PC:~/analysis of rnaseq bamfiles$ java -jar GenomeAnalysisTK.jar -R VitisVinifera.fasta -T SomaticIndelDetector -o indels.vcf -verbose indels.txt -I:normal wt.bam -I:tumor newmut.bam

MESSAGE: SAM/BAM file SAMFileReader{/home/elendin/analysis of rnaseq bamfiles/newmut.bam} is malformed: Read HWI-ST132_0461:3:2201:1211:140854#GTCCTA is either missing the read group or its read group is not defined in the BAM header, both of which are required by the GATK. Please use http://www.broadinstitute.org/gsa/wiki/index.php/ReplaceReadGroups to fix this problem

ERROR ------------------------------------------------------------------------------------------

how can i add the missing read group for HWI-ST132_0461:3:2201:1211:140854#GTCCTA or define it in the header. thanks

Created 2012-08-29 10:02:02 | Updated 2013-01-07 20:42:47 | Tags: best-practices readgroup realignment baserecalibration
Comments (3)


I am new at using GATK (v 2.1-3). I do exome sequencing by sample using the following steps: Alignment with BWA (0.6.2) GATK :Local realignment around INDELs PICARD (1.67): FixMateInformation GATK: Recalibration (BaseRecalibrator + PrintReads -BQSR) Samtools for calling variants

Samtools seems to run properly but no file (*.vcf and *.bcf) are created and no error message is prompted :

cd Sample_09 + samtools mpileup -BE -ug -q 20 -Q 20 -D -f human_g1k_v37.fasta realigned_fixed_recal.bam -C50 + bcftools view -bvcg - [mpileup] 1 samples in 1 input files Set max per-file depth to 8000 [bcfview] 100000 sites processed. [afs] 0:89274.054 1:6411.053 2:4314.893 [bcfview] 200000 sites processed. [afs] 0:89100.642 1:6125.883 2:4773.474 [bcfview] 300000 sites processed. [afs] 0:87374.996 1:7439.238 2:5185.766 [bcfview] 400000 sites processed. [afs] 0:87890.186 1:7087.628 2:5022.185 [bcfview] 500000 sites processed. [afs] 0:85322.061 1:8454.843 2:6223.096 [bcfview] 600000 sites processed. [afs] 0:85864.368 1:8410.777 2:5724.854 [bcfview] 700000 sites processed. [afs] 0:88813.814 1:6828.001 2:4358.185 [bcfview] 800000 sites processed. [afs] 0:89070.318 1:6302.924 2:4626.758 [bcfview] 900000 sites processed. [afs] 0:88364.380 1:6796.962 2:4838.658 [bcfview] 1000000 sites processed. [afs] 0:86892.531 1:7268.088 2:5839.381 [bcfview] 1100000 sites processed. [afs] 0:87184.845 1:7153.073 2:5662.081 [bcfview] 1200000 sites processed. [afs] 0:86762.756 1:7241.236 2:5996.008 [bcfview] 1300000 sites processed. [afs] 0:89346.143 1:6159.989 2:4493.868 [bcfview] 1400000 sites processed. [afs] 0:88658.355 1:7160.555 2:4181.089 [bcfview] 1500000 sites processed. [afs] 0:85985.305 1:8308.039 2:5706.656 [bcfview] 1600000 sites processed. [afs] 0:87346.636 1:7708.883 2:4944.480 [afs] 0:63097.202 1:3950.127 2:3572.670 + bcftools view .bcf

+ cd ..

I have seen that some groups use after realignment Picard:AddOrReplaceReadGroups and I wonder if I should use before calling the variant with samtools.

Thanks in advance for any advice you can give me.


Created 2012-08-25 06:24:57 | Updated 2013-01-07 20:45:01 | Tags: readgroup error
Comments (1)

ERROR MESSAGE: SAM/BAM file genome_110616_SN365_A_s_7_seq_GQJ-1.pe.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups

i am very new to GATK and i was trying to invoke the readcount command and i got the error above what is read groups.

thank you