Tagged with #indexing
1 documentation article | 0 announcements | 2 forum discussions


Comments (2)

Objective

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but this order is recommended.

Prerequisites

  • Installed Picard tools

Steps

  1. Sort the aligned reads by coordinate order
  2. Mark duplicates
  3. Add read group information
  4. Index the BAM file

Note

You may ask, is all of this really necessary? The GATK is notorious for imposing strict formatting guidelines and requiring the presence of information such as read groups that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.


1. Sort the aligned reads by coordinate order

Action

Run the following Picard command:

java -jar SortSam.jar \ 
    INPUT=unsorted_reads.bam \ 
    OUTPUT=sorted_reads.bam \ 
    SORT_ORDER=coordinate 

Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.


2. Mark duplicate reads

Action

Run the following Picard command:

java -jar MarkDuplicates.jar \ 
    INPUT=sorted_reads.bam \ 
    OUTPUT=dedup_reads.bam 

Expected Results

This creates a file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such.

More details

During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.


3. Add read group information

Action

Run the following Picard command:

java -jar AddOrReplaceReadGroups.jar  \ 
    INPUT=dedup_reads.bam \ 
    OUTPUT=addrg_reads.bam \ 
    RGID=group1 RGLB= lib1 RGPL=illumina RGPU=unit1 RGSM=sample1 

Expected Results

This creates a file called addrg_reads.bam with the same content as the input file, except that the reads will now have read group information attached.


4. Index the BAM file

Action

Run the following Picard command:

java -jar BuildBamIndex \ 
    INPUT=addrg_reads.bam 

Expected Results

This creates an index file called addrg_reads.bai, which is ready to be used in the Best Practices workflow.

Since Picard tools do not systematically create an index file when they output a new BAM file (unlike GATK tools, which will always output indexed files), it is best to keep the indexing step for last.

No posts found with the requested search criteria.
Comments (3)

Hello there,

I am trying to use GATK to annotate an already generated vcf file, but I am running against this error message :

INFO 12:49:29,316 HelpFormatter - --------------------------------------------------------------------------------- INFO 12:49:29,317 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.6-11-g3b2fab9, Compiled 2012/06/05 21:00:10 INFO 12:49:29,318 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 12:49:29,318 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki INFO 12:49:29,318 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa INFO 12:49:29,318 HelpFormatter - Program Args: -R reference/gp140.fa -V data/vcf/GP140-10A_S19.vcf -T VariantAnnotator -A >DepthPerAlleleBySample -o annotated.vcf INFO 12:49:29,319 HelpFormatter - Date/Time: 2014/03/04 12:49:29 INFO 12:49:29,319 HelpFormatter - --------------------------------------------------------------------------------- INFO 12:49:29,319 HelpFormatter - --------------------------------------------------------------------------------- INFO 12:49:29,325 RodBindingArgumentTypeDescriptor - Dynamically determined type of data/vcf/GP140-10A_S19.vcf to be VCF INFO 12:49:29,333 GenomeAnalysisEngine - Strictness is SILENT WARN 12:49:29,341 FSLockWithShared - WARNING: Unable to lock file /share/lustre/raniba/customgenome/reference/gp140.dict: >Function not implemented. INFO 12:49:29,342 ReferenceDataSource - Unable to create a lock on dictionary file: Function not implemented INFO 12:49:29,342 ReferenceDataSource - Treating existing dictionary file as complete. WARN 12:49:29,344 FSLockWithShared - WARNING: Unable to lock file /share/lustre/raniba/customgenome/reference/gp140.fa.fai: >Function not implemented. INFO 12:49:29,344 ReferenceDataSource - Unable to create a lock on index file: Function not implemented INFO 12:49:29,344 ReferenceDataSource - Treating existing index file as complete. INFO 12:49:29,963 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 1.6-11-g3b2fab9):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
ERROR
ERROR MESSAGE: Invalid command line: Failed to load reference dictionary
ERROR ------------------------------------------------------------------------------------------

Apparently it is related on a corrupted .dct file. I am using a custom genome. After indexing the reference with bwa I figured out that the bct file was empty, all the other files are Ok

Could be the size of the genome limiting in indexing ? (it is a small custom genome file)

Any idea ?

Comments (1)

Hi,

Is there any way to skip "indexing" the output vcf file when running VariantFiltration? It is taking too much time of our pipeline and it is unnecessary since we are parsing the vcf file and don't need the index file.

Thanks, - mp