# Tagged with #readgroup 4 documentation articles | 0 announcements | 8 forum discussions

Created 2015-11-23 21:11:51 | Updated 2015-12-10 21:53:28 | Tags: bam readgroup fastq pre-processing ubam

Here we outline how to generate an unmapped BAM (uBAM) from either a FASTQ or aligned BAM file. We use Picard's FastqToSam to convert a FASTQ (Option A) or Picard's RevertSam to convert an aligned BAM (Option B).

(A) Convert FASTQ to uBAM and add read group information using FastqToSam (B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

#### Prerequisites

• Installed Picard tools

Tutorial data reads were originally aligned to the advanced tutorial bundle's human_g1k_v37_decoy.fasta reference and to 10:91,000,000-92,000,000.

#### Related resources

• Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure your read group fields conform to the recommendations.
• This How to is part of a larger workflow and tutorial on (How to) Efficiently map and clean up short read sequence data.
• To extract reads in a genomic interval from the aligned BAM, use GATK's PrintReads.
• In the future we will post on how to generate a uBAM from BCL data using IlluminaBasecallsToSAM.

### (A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar /path/picard.jar FastqToSam \
FASTQ=6484_snippet_XT_interleaved.fq \ #our single tutorial file contains both reads in a pair
OUTPUT=6484_snippet_fastqtosam.bam \
READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
SAMPLE_NAME=NA12878 \ #required
LIBRARY_NAME=Solexa-272222 \ #required
PLATFORM_UNIT=H0164ALXX140820.2 \
PLATFORM=illumina \ #recommended
SEQUENCING_CENTER=BI \
RUN_DATE=2014-08-20T00:00:00-0400

Some details on select parameters:

• QUALITY_FORMAT is detected automatically if unspecified.
• SORT_ORDER by default is queryname.
• Specify both FASTQ and FASTQ2 for paired reads in separate files.
• PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
• RUN_DATE is in Iso8601 date format.

### (B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

We use Picard's RevertSam to remove alignment information and generate an unmapped BAM (uBAM). For our tutorial file we have to call on some additional parameters that we explain below. This illustrates the need to cater the tool's parameters to each dataset. As such, it is a good idea to test the reversion process on a subset of reads before committing to reverting the entirety of a large BAM. Follow the directions in this How to to create a snippet of aligned reads corresponding to a genomic interval.

We use the following parameters.

java -Xmx8G -jar /path/picard.jar RevertSam \
I=6484_snippet.bam \
O=6484_snippet_revertsam.bam \
SANITIZE=true \
MAX_DISCARD_FRACTION=0.005 \ #informational; does not affect processing
ATTRIBUTE_TO_CLEAR=XT \
ATTRIBUTE_TO_CLEAR=XN \
ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
ATTRIBUTE_TO_CLEAR=OC \
ATTRIBUTE_TO_CLEAR=OP \
SORT_ORDER=queryname \ #default
RESTORE_ORIGINAL_QUALITIES=true \ #default
REMOVE_DUPLICATE_INFORMATION=true \ #default
REMOVE_ALIGNMENT_INFORMATION=true #default

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory

We invoke or change multiple RevertSam parameters to generate an unmapped BAM

• We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR option. Standard tags cleared by default are NM, UQ, PG, MD, MQ, SA, MC, and AS tags (AS for Picard releases starting 9/2015). Additionally, the OQ tag is removed by the default RESTORE_ORIGINAL_QUALITIES parameter. Remove all other nonstandard tags by specifying each with the ATTRIBUTE_TO_CLEAR option. For example, we clear the XT tag using this option for our tutorial file so that it is free for use by other tools, e.g. MarkIlluminaAdapters. To list all tags within a BAM, use the command below.

samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[1]++) { print }}'  For the tutorial file, this gives RG, OC, XN, OP and XT tags as well as those removed by default. We remove all of these except the RG tag. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags. • Additionally, we invoke the SANITIZE option to remove reads that cause problems for certain tools, e.g. MarkIlluminaAdapters. Downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Any paired reads file subset for a genomic interval requires sanitizing to remove reads with lost mates that align outside of the interval. • In this command, we've set MAX_DISCARD_FRACTION to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing. Exception in thread "main" picard.PicardException: Discarded 0.787% which is above MAX_DISCARD_FRACTION of 0.500%  Some comments on options kept at default: • SORT_ORDER=queryname For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed. • RESTORE_ORIGINAL_QUALITIES=true Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag. • REMOVE_ALIGNMENT_INFORMATION=true will remove program group records and alignment flag and tag information. For example, flags reset to unmapped values, e.g. 77 and 141 for paired reads. The parameter also invokes the default ATTRIBUTE_TO_CLEAR parameter which removes standard alignment tags. RevertSam ignores ATTRIBUTE_TO_CLEAR when REMOVE_ALIGNMENT_INFORMATION=false. Below we show below a read pair before and after RevertSam from the tutorial data. Notice the first listed read in the pair becomes reverse-complemented after RevertSam. This restores how reads are represented when they come off the sequencer--5' to 3' of the read being sequenced. For 6484_snippet.bam, SANITIZE removes 2,202 out of 279,796 (0.787%) reads, leaving us with 277,594 reads. Original BAM H0164ALXX140820:2:1101:10003:23460 83 10 91515318 60 151M = 91515130 -339 CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151 PG:Z:MarkDuplicates RG:Z:H0164.2 NM:i:0 MQ:i:0 OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA UQ:i:0 AS:i:151 H0164ALXX140820:2:1101:10003:23460 163 10 91515130 0 60M91S = 91515318 339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1; MC:Z:151M MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2 NM:i:2 MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### UQ:i:49 AS:i:50 After RevertSam H0164ALXX140820:2:1101:10003:23460 77 * 0 0 * * 0 0 TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJ H0164ALXX140820:2:1101:10003:23460 141 * 0 0 * * 0 0 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2 back to top Created 2015-11-20 19:22:28 | Updated 2015-11-25 18:15:15 | Tags: readgroup addorreplacereadgroups There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument. In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group. Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups. To see the read group information for a BAM file, use the following command. samtools view -H sample.bam | grep '@RG' This prints the lines starting with @RG within the header, e.g. as shown in the example below. @RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI ### Meaning of the read group fields required by GATK • ID = Read group identifier This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world. Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model. There is another read group tag called PU (referring to the platform unit, holding flowcell lane information) that is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects. • SM = Sample The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name. • PL = Platform/technology used to produce the read This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. • LB = DNA preparation library identifier MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes. If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here. ### Deriving ID and PU fields from read names Here we illustrate how to derive both ID and PU fields from read names. We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster. H0164ALXX140820:2:1101:10003:23460 H0164ALXX140820:2:1101:15118:25288 Breaking down the common portion of the query names: H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell _____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run _______________:2 #portion of @RG ID and PU fields indicating flow cell lane ### Multi-sample and multiplexed example Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header: Dad's data: @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 Mom's data: @RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 @RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 Kid's data: @RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 @RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library). Created 2013-07-03 00:53:53 | Updated 2015-12-04 18:55:56 | Tags: readgroup picard indexing Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. The options on this page are listed in order of decreasing complexity. You may ask, is all of this really necessary? The GATK imposes strict formatting guidelines, including requiring certain read group information, that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options. #### Prerequisites • Installed Picard tools • If indexing or marking duplicates, then coordinate sorted reads • If coordinate sorting, then reference aligned reads • For each read group ID, a single BAM file. If you have a multiplexed file, separate to individual files per sample. #### Jump to a section on this page 1. Add read groups, coordinate sort and index using AddOrReplaceReadGroups 2. Coordinate sort and index using SortSam 3. Index an already coordinate-sorted BAM using BuildBamIndex 4. Mark duplicates using MarkDuplicates #### Tools involved #### Related resources • Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure that your read group fields conform to the recommendations. • Picard's standard options includes adding CREATE_INDEX to the commands of any of its tools that produce coordinate sorted BAMs. ### 1. Add read groups, coordinate sort and index using AddOrReplaceReadGroups Use Picard's AddOrReplaceReadGroups to appropriately label read group (@RG) fields, coordinate sort and index a BAM file. Only the five required @RG fields are included in the command shown. Consider the other optional @RG fields for better record keeping. java -jar picard.jar AddOrReplaceReadGroups \ INPUT=reads.bam \ OUTPUT=reads_addRG.bam \ RGID=H0164.2 \ #be sure to change from default of 1 RGLB= library1 \ RGPL=illumina \ RGPU=H0164ALXX140820.2 \ RGSM=sample1 \  This creates a file called reads_addRG.bam with the same content and sorting as the input file, except the SAM record header's @RG line will be updated with the new information for the specified fields and each read will now have an RG tag filled with the @RG ID field information. Because of this repetition, the length of the @RG ID field contributes to file size. To additionally coordinate sort by genomic location and create a .bai index, add the following options to the command.  SORT_ORDER=coordinate \ CREATE_INDEX=true To process large files, also designate a temporary directory.  TMP_DIR=/path/shlee #sets environmental variable for temporary directory ### 2. Coordinate sort and index using SortSam Picard's SortSam both sorts and indexes and converts between SAM and BAM formats. For coordinate sorting, reads must be aligned to a reference genome. java -jar picard.jar SortSam \ INPUT=reads.bam \ OUTPUT=reads_sorted.bam \ SORT_ORDER=coordinate \ Concurrently index by tacking on the following option.  CREATE_INDEX=true This creates a file called reads_sorted.bam containing reads sorted by genomic location, aka coordinate, and a .bai index file with the same prefix as the output, e.g. reads_sorted.bai, within the same directory. To process large files, also designate a temporary directory.  TMP_DIR=/path/shlee #sets environmental variable for temporary directory ### 3. Index an already coordinate-sorted BAM using BuildBamIndex Picard's BuildBamIndex allows you to index a BAM that is already coordinate sorted. java -jar picard.jar BuildBamIndex \ INPUT=reads_sorted.bam  This creates a .bai index file with the same prefix as the input file, e.g. reads_sorted.bai, within the same directory as the input file. You want to keep this default behavior as many tools require the same prefix and directory location for the pair of files. Note that Picard tools do not systematically create an index file when they output a new BAM file, whereas GATK tools will always output indexed files. To process large files, also designate a temporary directory.  TMP_DIR=/path/shlee #sets environmental variable for temporary directory ### 4. Mark duplicates using MarkDuplicates Picard's MarkDuplicates flags both PCR and optical duplicate reads with a 1024 (0x400) SAM flag. The input BAM must be coordinate sorted. java -jar picard.jar MarkDuplicates \ INPUT=reads_sorted.bam \ OUTPUT=reads_markdup.bam \ METRICS_FILE=metrics.txt \ CREATE_INDEX=true This creates a file called reads_markdup.bam with duplicate reads marked. It also creates a file called metrics.txt containing duplicate read data metrics and a .bai index file. To process large files, also designate a temporary directory.  TMP_DIR=/path/shlee #sets environmental variable for temporary directory • During sequencing, which involves PCR amplification within the sequencing machine, by a stochastic process we end up sequencing a proportion of DNA molecules that arose from the same parent insert. To be stringent in our variant discovery, GATK tools discount the duplicate reads as evidence for or against a putative variant. • Marking duplicates is less relevant to targeted amplicon sequencing and RNA-Seq analysis. • Optical duplicates arise from a read being sequenced twice as neighboring clusters. back to top Created 2012-07-23 18:04:20 | Updated 2016-01-04 21:27:56 | Tags: bam readgroup utilities addorreplacereadgroups ### What are read groups? See the Dictionary entry on read groups. ### Errors about missing or undefined read groups As detailed in the FAQs about input requirements, GATK expects all read groups appearing in the read data to be specified in the file header, and will fail with an error if it does not find that information (whether there is no read group information in the file, or a subset of reads do not have read groups). Typically you should read group information when you perform the original alignments (with e.g. BWA, which has an option to do so). So what do you do if you forgot to do that, and you don't want to have to rerun BWA all over again? ### Solution You can use a Picard tool called AddOrReplaceReadGroups to add the missing information to your input file. Here's an example: # throws an error java -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R reference.fasta \ -I reads_without_RG.bam \ -o output.vcf # fix the read groups java -jar picard.jar AddOrReplaceReadGroups \ I= reads_without_RG.bam \ O= reads_with_RG.bam \ SORT_ORDER=coordinate \ RGID=foo \ RGLB=bar \ RGPL=illumina \ RGSM=Sample1 \ CREATE_INDEX=True # runs without error java -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R reference.fasta \ -I reads_with_RG.bam \ -o output.vcf Note that if you don't know what information to put in the read groups, you should ask whoever performed the sequencing or provided the BAM to give you the metadata you need. No posts found with the requested search criteria. Created 2016-01-12 04:29:35 | Updated | Tags: readgroup variant denovo replicate Hi, I am working on a project which requires me to identify Variants in a marine species using RNAseq data. I have assembled the transcriptome (using all available RNASeq data sets for the experiment) for my organism (no Reference is available). I then referred to the link 'http://gatkforums.broadinstitute.org/gatk/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference' and created the '.dict' file and indexed the reference. java -Xmx3g -jar /share/apps/picard/1.115/CreateSequenceDictionary.jar R=seaurchin_topHit_Ids.fasta O=seaurchin_topHit_Ids.dict samtools faidx seaurchin_topHit_Ids.fasta I have the following RNASeq data-sets available: Two growth Conditions : Control (C) vs Bubble (B) At two Growth Times: 8h and 48h RNASeq data is generated in replicates (R1 and R2) So in all I have the following read-sets C1-8h , C2-8h C1-48h, C2-48h B1-8h, B2-8h B1-48h, B2-48h The questions we might want to ask are : Identify variants : Q1)Across conditions : Condition C (Control) vs Condition B Q2)Across times : Growth times 8h vs 48h I am following the best practice guidelines for GATK for RNASeq data.. 1) Mapping individual data-sets using bwa (I will repeat the pipeline using STAR later, if I can get through all steps)..I have been using bwa for a while, so felt more comfortable using the tool. 2) sort the individual 'sam files' and convert into 'bam' 3) Mark Duplicates for individual bams to get the '.dedup.bam files' 4) Add Read groups I am keeping the RGSM tag common across datasets belonging to the same sample and separating the replicates by the RGID tag.. e.g. java -Xmx3g -jar /share/apps/picard/1.115/AddOrReplaceReadGroups.jar \ I= B1-BB-8h.dedup.bam \ O= B1-BB-8h.dedup_RG.bam \ SORT_ORDER=coordinate \ RGID=B1-BB-8h.dedup \ RGLB=BB8h \ RGPL=illumina \ RGSM=BB8h \ CREATE_INDEX=True \ RGPU=run_barcode java -Xmx3g -jar /share/apps/picard/1.115/AddOrReplaceReadGroups.jar \ I= B2-BB-8h.dedup.bam \ O= B2-BB-8h.dedup_RG.bam \ SORT_ORDER=coordinate \ RGID=B2-BB-8h.dedup \ RGLB=BB8h \ RGPL=illumina \ RGSM=BB8h \ CREATE_INDEX=True \ RGPU=run_barcode java -Xmx3g -jar /share/apps/picard/1.115/AddOrReplaceReadGroups.jar \ I= B1-BC-8h.dedup.bam \ O= B1-BC-8h.dedup_RG.bam \ SORT_ORDER=coordinate \ RGID=B1-BC-8h.dedup \ RGLB=BB8h \ RGPL=illumina \ RGSM=BC8h \ CREATE_INDEX=True \ RGPU=run_barcode Can someone confirm if the above steps look alright? I am a bit unsure as to what I do further!! Should I merge all dedup.bam files and do the downstream processing ? Since I have the read groups marked (associating samples but separating the conditions and growth times), can I expect to identify SNPs/indels etc for Q1 and Q2 ? I will appreciate if someone can guide me through this, regards, Nandan Created 2014-09-08 11:15:00 | Updated 2014-09-08 11:16:08 | Tags: baserecalibrator readgroup header rg pu lb Hi, I was wondering about how GATK handles bam read group info, as behaviour I have observed is not as expected from reading the documentation. I am running a comparison of different illumina platforms, using the same library, so have given my bam files RG info that I think reflects this: @RG ID:2_1 PL:ILLUMINA PU:Platform1_FCA LB:Sample2 DT:2013-09-01T01:00:00+0100 SM:Sample2_Platform1 @RG ID:2_2 PL:ILLUMINA PU:Platform2_FCB LB:Sample2 DT:2013-07-17T01:00:00+0100 SM:Sample2_Platform2 @RG ID:2_3 PL:ILLUMINA PU:Platform3_FCC LB:Sample2 DT:2014-05-16T01:00:00+0100 SM:Sample2_Platform3 (IDs changed to something more generic) I have run BaseRecalibrator, and the read group results are broken down by the PU field: ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors Platform1_FCA M 28.2433 28.1998 6283676400.00 9416437.03 Platform1_FCA I 41.2350 45.0000 6283676400.00 472843.68 Platform1_FCA D 40.2707 45.0000 6283676400.00 590401.90 Platform2_FCB M 28.7442 30.5860 258157515.00 344716.76 Platform2_FCB I 44.4728 45.0000 258157515.00 9216.35 Platform2_FCB D 41.1310 45.0000 258157515.00 19896.01 Platform3_FCC M 22.9983 23.2025 2510670817.00 12588158.43 Platform3_FCC I 44.0934 45.0000 2510670817.00 97824.77 Platform3_FCC D 41.1932 45.0000 2510670817.00 190753.01 This is not what I would expect given that the FAQ for GATK states We do not require value for the CN, DS, DT, PG, PI, or PU fields. I would expect that the RG ID field is used, that is the unique identifier for the Read group, and is by definition unique. Failing that I would expect the SM field to be used as this is stated as being used in the GATK documentation and is unique in my bam file. I'm guessing GATK would use the LB field preferentially but as this is not unique in my case it uses something else that is. I can find no errors or warnings about this in the output or stdout/stderr. I'm wondering what would happen when I add a second sample, with the same PU IDs? Can anyone clarify how GATK BaseRecalibrator (and other tools) handle read group fields, especially where LB is the same for multiple read groups? I need to know if I am using the RG fields correctly for my experiment, and also so that the end results of the GATK pipeline best reflect what I am trying to analyse. Many thanks Anna Created 2013-01-14 11:27:11 | Updated 2013-01-14 14:32:46 | Tags: readgroup Hi, I have a bam file with multiple read groups for same sample. Does variant calling algorithm (UnifiedGenotyper) will consider bam file as multiple-sample data or a single sample-data (irrespective of read groups) for calling varaints? Eg. Read Groups in BAM file: @RG ID:41852 PL:illumina PU:41852 LB:nolib SM:41852p @RG ID:41852.1 PL:illumina PU:41852 LB:nolib SM:41852s @RG ID:41853 PL:illumina PU:41853 LB:nolib SM:41853s @RG ID:41854 PL:illumina PU:41854 LB:nolib SM:41854p @RG ID:41854.4 PL:illumina PU:41854 LB:nolib SM:41854s @RG ID:41855 PL:illumina PU:41855 LB:nolib SM:41855p @RG ID:41855.6 PL:illumina PU:41855 LB:nolib SM:41855s Variants call in VCF file #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 41852p 41852s 41853s 41854p 41854s 41855p 41855s chrM 150 . T C 41679.01 PASS ABHom=0.999;AC=14;AF=1.00;AN=14;BaseQRankSum=1.362;DP=1255;DS;Dels=0.00;FS=0.000;HaplotypeScore=4.6324;MLEAC=14;MLEAF=1.00;MQ=40.73;MQ0=1;MQRankSum=-0.678;OND=3.149e-03;QD=33.21;ReadPosRankSum=1.479;SB=-2.052e+04 GT:AD:DP:GQ:PL 1/1:0,200:200:99:6075,544,0 1/1:0,200:200:99:6315,568,0 1/1:0,200:200:99:7094,599,0 1/1:1,197:198:99:6820,547,0 1/1:0,113:113:99:3624,322,0 1/1:0,200:200:99:7111,599,0 1/1:0,141:141:99:4640,403,0 Regards Gaurav Created 2012-12-24 20:09:25 | Updated 2013-01-07 19:16:08 | Tags: bam readgroup picard how to add read group to the bam file using PICARD generated from GS reference mapper BAM file? Created 2012-09-06 20:17:20 | Updated 2012-09-07 01:55:51 | Tags: readgroup Hi, I have two bam files from one case and one control. They were both mapped with bwa. In their bam files, their readgroup tags are with different sample ID (SM), but with the same library ID (LB). I then fed the two bam files into GATK together for realignment, and the output is a merged bam for the two samples. I am just a bit unsure, if the two samples with the same LB id will affect the downstream variant calls? can the caller can distinguish them by their different sample IDs? Thank you very much for your kind guidance! Created 2012-09-04 22:27:10 | Updated 2013-01-07 20:41:31 | Tags: somaticindeldetector readgroup hello every one please i got this error while running this command elendin@elendin-HP-Pavilion-dv6700-Notebook-PC:~/analysis of rnaseq bamfiles java -jar GenomeAnalysisTK.jar -R VitisVinifera.fasta -T SomaticIndelDetector -o indels.vcf -verbose indels.txt -I:normal wt.bam -I:tumor newmut.bam

##### ERROR ------------------------------------------------------------------------------------------

how can i add the missing read group for HWI-ST132_0461:3:2201:1211:140854#GTCCTA or define it in the header. thanks

Created 2012-08-29 10:02:02 | Updated 2013-01-07 20:42:47 | Tags: best-practices readgroup realignment baserecalibration

Hello,

I am new at using GATK (v 2.1-3). I do exome sequencing by sample using the following steps: Alignment with BWA (0.6.2) GATK :Local realignment around INDELs PICARD (1.67): FixMateInformation GATK: Recalibration (BaseRecalibrator + PrintReads -BQSR) Samtools for calling variants

## Samtools seems to run properly but no file (.vcf and .bcf) are created and no error message is prompted :

cd Sample_09

• samtools mpileup -BE -ug -q 20 -Q 20 -D -f human_g1k_v37.fasta realigned_fixed_recal.bam -C50
• bcftools view -bvcg - [mpileup] 1 samples in 1 input files Set max per-file depth to 8000 [bcfview] 100000 sites processed. [afs] 0:89274.054 1:6411.053 2:4314.893 [bcfview] 200000 sites processed. [afs] 0:89100.642 1:6125.883 2:4773.474 [bcfview] 300000 sites processed. [afs] 0:87374.996 1:7439.238 2:5185.766 [bcfview] 400000 sites processed. [afs] 0:87890.186 1:7087.628 2:5022.185 [bcfview] 500000 sites processed. [afs] 0:85322.061 1:8454.843 2:6223.096 [bcfview] 600000 sites processed. [afs] 0:85864.368 1:8410.777 2:5724.854 [bcfview] 700000 sites processed. [afs] 0:88813.814 1:6828.001 2:4358.185 [bcfview] 800000 sites processed. [afs] 0:89070.318 1:6302.924 2:4626.758 [bcfview] 900000 sites processed. [afs] 0:88364.380 1:6796.962 2:4838.658 [bcfview] 1000000 sites processed. [afs] 0:86892.531 1:7268.088 2:5839.381 [bcfview] 1100000 sites processed. [afs] 0:87184.845 1:7153.073 2:5662.081 [bcfview] 1200000 sites processed. [afs] 0:86762.756 1:7241.236 2:5996.008 [bcfview] 1300000 sites processed. [afs] 0:89346.143 1:6159.989 2:4493.868 [bcfview] 1400000 sites processed. [afs] 0:88658.355 1:7160.555 2:4181.089 [bcfview] 1500000 sites processed. [afs] 0:85985.305 1:8308.039 2:5706.656 [bcfview] 1600000 sites processed. [afs] 0:87346.636 1:7708.883 2:4944.480 [afs] 0:63097.202 1:3950.127 2:3572.670
• bcftools view .bcf
• ## cd ..

I have seen that some groups use after realignment Picard:AddOrReplaceReadGroups and I wonder if I should use before calling the variant with samtools.