Tagged with #indexing
1 documentation article | 0 announcements | 2 forum discussions

Created 2013-07-03 00:53:53 | Updated 2015-11-25 21:09:04 | Tags: analyst readgroup picard indexing
Comments (5)

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. The options on this page are listed in order of decreasing complexity.

You may ask, is all of this really necessary? The GATK imposes strict formatting guidelines, including requiring certain read group information, that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.


  • Installed Picard tools
  • If indexing or marking duplicates, then coordinate sorted reads
  • If coordinate sorting, then reference aligned reads
  • For each read group ID, a single BAM file. If you have a multiplexed file, separate to individual files per sample.

Jump to sections on this page

  1. Add read groups, coordinate sort and index using AddOrReplaceReadGroups
  2. Coordinate sort and index using SortSam
  3. Index an already coordinate-sorted BAM using BuildBamIndex
  4. Mark duplicates using MarkDuplicates

Tools involved

Related resources

  • Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure that your read group fields conform to the recommendations.
  • Picard's standard options includes adding CREATE_INDEX to the commands of any of its tools that produce coordinate sorted BAMs.

1. Add read groups, coordinate sort and index using AddOrReplaceReadGroups

Use Picard's AddOrReplaceReadGroups to appropriately label read group (@RG) fields, coordinate sort and index a BAM file. Only the five required @RG fields are included in the command shown. Consider the other optional @RG fields for better record keeping.

java -jar picard.jar AddOrReplaceReadGroups \ 
    INPUT=reads.bam \ 
    OUTPUT=reads_addRG.bam \ 
    RGID=H0164.2 \ #be sure to change from default of 1
    RGLB= library1 \ 
    RGPL=illumina \ 
    RGPU=H0164ALXX140820.2 \ 
    RGSM=sample1 \ 

This creates a file called reads_addRG.bam with the same content and sorting as the input file, except the SAM record header's @RG line will be updated with the new information for the specified fields and each read will now have an RG tag filled with the @RG ID field information. Because of this repetition, the length of the @RG ID field contributes to file size.

To additionally coordinate sort by genomic location and create a .bai index, add the following parameters to the command.

    SORT_ORDER=coordinate \ 

2. Coordinate sort and index using SortSam

Picard's SortSam both sorts and indexes and converts between SAM and BAM formats. For coordinate sorting, reads must be aligned to a reference genome.

java -jar picard.jar SortSam \ 
    INPUT=reads.bam \ 
    OUTPUT=reads_sorted.bam \ 
    SORT_ORDER=coordinate \

Concurrently index by tacking on the following optional parameter.


This creates a file called reads_sorted.bam containing reads sorted by genomic location, aka coordinate, and a .bai index file with the same prefix as the output, e.g. reads_sorted.bai, within the same directory.

3. Index an already coordinate-sorted BAM using BuildBamIndex

Picard's BuildBamIndex allows you to index a BAM that is already coordinate sorted.

java -jar picard.jar BuildBamIndex \ 

This creates a .bai index file with the same prefix as the input file, e.g. reads_sorted.bai, within the same directory as the input file. You want to keep this default behavior as many tools require the same prefix and directory location for the pair of files. Note that Picard tools do not systematically create an index file when they output a new BAM file, whereas GATK tools will always output indexed files.

4. Mark duplicates using MarkDuplicates

Picard's MarkDuplicates flags both PCR and optical duplicate reads with a 1024 (0x400) SAM flag. The input BAM must be coordinate sorted.

java -jar picard.jar MarkDuplicates \ 
    INPUT=reads_sorted.bam \ 
    OUTPUT=reads_markdup.bam \
    METRICS_FILE=metrics.txt \

This creates a file called reads_markdup.bam with duplicate reads marked. It also creates a file called metrics.txt containing duplicate read data metrics and a .bai index file.

  • Marking duplicates is more relevant to DNA variant discovery than to RNA-Seq analysis. For historic reasons, GATK tools require duplicates to be marked regardless of downstream applicability.
  • During sequencing, which involves PCR amplification within the sequencing machine, by a stochastic process we end up sequencing a proportion of DNA molecules that arose from the same parent insert. To be stringent in our variant discovery, GATK tools discount the duplicate reads as evidence for or against a putative variant.
  • Optical duplicates arise from a read being read twice as neighboring clusters.

back to top

No posts found with the requested search criteria.

Created 2014-03-04 20:58:01 | Updated | Tags: indexing custom-genome
Comments (3)

Hello there,

I am trying to use GATK to annotate an already generated vcf file, but I am running against this error message :

INFO 12:49:29,316 HelpFormatter - --------------------------------------------------------------------------------- INFO 12:49:29,317 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.6-11-g3b2fab9, Compiled 2012/06/05 21:00:10 INFO 12:49:29,318 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 12:49:29,318 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki INFO 12:49:29,318 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa INFO 12:49:29,318 HelpFormatter - Program Args: -R reference/gp140.fa -V data/vcf/GP140-10A_S19.vcf -T VariantAnnotator -A >DepthPerAlleleBySample -o annotated.vcf INFO 12:49:29,319 HelpFormatter - Date/Time: 2014/03/04 12:49:29 INFO 12:49:29,319 HelpFormatter - --------------------------------------------------------------------------------- INFO 12:49:29,319 HelpFormatter - --------------------------------------------------------------------------------- INFO 12:49:29,325 RodBindingArgumentTypeDescriptor - Dynamically determined type of data/vcf/GP140-10A_S19.vcf to be VCF INFO 12:49:29,333 GenomeAnalysisEngine - Strictness is SILENT WARN 12:49:29,341 FSLockWithShared - WARNING: Unable to lock file /share/lustre/raniba/customgenome/reference/gp140.dict: >Function not implemented. INFO 12:49:29,342 ReferenceDataSource - Unable to create a lock on dictionary file: Function not implemented INFO 12:49:29,342 ReferenceDataSource - Treating existing dictionary file as complete. WARN 12:49:29,344 FSLockWithShared - WARNING: Unable to lock file /share/lustre/raniba/customgenome/reference/gp140.fa.fai: >Function not implemented. INFO 12:49:29,344 ReferenceDataSource - Unable to create a lock on index file: Function not implemented INFO 12:49:29,344 ReferenceDataSource - Treating existing index file as complete. INFO 12:49:29,963 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 1.6-11-g3b2fab9):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
ERROR MESSAGE: Invalid command line: Failed to load reference dictionary
ERROR ------------------------------------------------------------------------------------------

Apparently it is related on a corrupted .dct file. I am using a custom genome. After indexing the reference with bwa I figured out that the bct file was empty, all the other files are Ok

Could be the size of the genome limiting in indexing ? (it is a small custom genome file)

Any idea ?

Created 2013-05-27 02:26:43 | Updated | Tags: variantfiltrationwalker indexing
Comments (1)


Is there any way to skip "indexing" the output vcf file when running VariantFiltration? It is taking too much time of our pipeline and it is unnecessary since we are parsing the vcf file and don't need the index file.

Thanks, - mp