Tagged with #reordersam
2 documentation articles | 0 announcements | 1 forum discussion

Created 2012-08-11 06:46:51 | Updated 2015-05-26 17:35:16 | Tags: vcf bam script reordersam sorting
Comments (9)

This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order.

So what do you do?

For BAM files

You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this:

java -jar picard.jar ReorderSam I=original.bam R=reference.fasta O=reordered.bam

Where reference.fasta is your genome reference, which must be accompanied by a valid *.dict dictionary file.

For VCF files

You run this handy little script on the VCF to re-sort it to match the reference.

#!/usr/bin/perl -w

use strict;
use Getopt::Long;

sub usage {

    print "\nUsage:\n";
    print "sortByRef.pl [--k POS] INPUT REF_DICT\n\n";

    print " Sorts lines of the input file INFILE according\n";
    print " to the reference contig order specified by the\n";
    print " reference dictionary REF_DICT (.fai file).\n";
    print " The sort is stable. If -k option is not specified,\n";
    print " it is assumed that the contig name is the first\n";
    print " field in each line.\n\n";
    print "  INPUT      input file to sort. If '-' is specified, \n";
    print "             then reads from STDIN.\n";
    print "  REF_DICT   .fai file, or ANY file that has contigs, in the\n";
    print "             desired sorting order, as its first column.\n";
    print "  --k POS :  contig name is in the field POS (1-based)\n";
    print "             of input lines.\n\n";


my $pos = 1;
GetOptions( "k:i" => \$pos );


usage() if ( scalar(@ARGV) == 0 );

if ( scalar(@ARGV) != 2 ) {
    print "Wrong number of arguments\n";

my $input_file = $ARGV[0];
my $dict_file = $ARGV[1];

open(DICT, "< $dict_file") or die("Can not open $dict_file: $!");

my %ref_order;

my $n = 0;
while ( <DICT> ) {
    my ($contig, $rest) = split "\t";
    die("Dictionary file is probably corrupt: multiple instances of contig $contig") if ( defined $ref_order{$contig} );

    $ref_order{$contig} = $n;

close DICT;
#we have loaded contig ordering now

my $INPUT;
if ($input_file eq "-" ) {
    $INPUT = "STDIN";
} else {
    open($INPUT, "< $input_file") or die("Can not open $input_file: $!");

my %temp_outputs;

while ( <$INPUT> ) {

    my @fields = split '\s';
    die("Specified field position exceeds the number of fields:\n$_") 
        if ( $pos >= scalar(@fields) );

    my $contig = $fields[$pos];
    if ( $contig =~ m/:/ ) {
        my @loc = split(/:/, $contig);
        # print $contig . " " . $loc[0] . "\n";
        $contig = $loc[0]
    chomp $contig if ( $pos == scalar(@fields) - 1 ); # if last field in line

    my $order;
    if ( defined $ref_order{$contig} ) { $order = $ref_order{$contig}; }
    else {
        $order = $n; # input line has contig that was not in the dict; 
        $n++; # this contig will go at the end of the output, 
              # after all known contigs

    my $fhandle;
    if ( defined $temp_outputs{$order} ) { $fhandle = $temp_outputs{$order} }
    else {
        #print "opening $order $$ $_\n";
        open( $fhandle, " > /tmp/sortByRef.$$.$order.tmp" ) or
            die ( "Can not open temporary file $order: $!");
        $temp_outputs{$order} = $fhandle;

    # we got the handle to the temp file that keeps all 
    # lines with contig $contig

    print $fhandle $_; # send current line to its corresponding temp file

close $INPUT;

foreach my $f ( values %temp_outputs ) { close $f; }

# now collect back into single output stream:

for ( my $i = 0 ; $i < $n ; $i++ ) {
    # if we did not have any lines on contig $i, then there's 
    # no temp file and nothing to do
    next if ( ! defined $temp_outputs{$i} ) ; 

    my $f; 
    open ( $f, "< /tmp/sortByRef.$$.$i.tmp" );
    while ( <$f> ) { print ; }
    close $f;

    unlink "/tmp/sortByRef.$$.$i.tmp";


Created 2012-07-23 18:02:24 | Updated 2015-05-15 09:13:41 | Tags: bam utilities picard reordersam sorting order
Comments (17)

This error occurs when for example, a collaborator gives you a BAM that's derived from what was originally the same reference as you are using, but for whatever reason the contigs are not sorted in the same order .The GATK can be particular about the ordering of a BAM file so it will fail with an error in this case.

So what do you do? You use a Picard tool called ReorderSam to, well, reorder your BAM file.

Here's an example usage where we reorder a BAM file that was sorted lexicographically so that the output will be another BAM, but this time sorted karyotypically :

java -jar picard.jar ReorderSam \
    I= lexicographic.bam \
    O= kayrotypic.bam \
    REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta

This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!

This tool is part of the Picard package.

No posts found with the requested search criteria.

Created 2014-04-17 19:41:42 | Updated 2014-04-17 19:42:38 | Tags: indelrealigner reference-error indel-vcf-gatk reordersam
Comments (6)

Hi. I'm unable to use the IndelRealigner java jar. My previous steps were;

1) Convert Bowtie2 paired-end Illumina Reads .sam to .bam

2) Use bedtools to extract pairs that fall within the Hg19 exome.

3) Convert the new .bam to .sam

4) Sort the new .sam via SortSam.jar

5) Mark duplicates via MarkDuplicates.jar

6) Use AddOrReplaceReadGroups.jar

7 ) Use ReorderSam.jar

8) Use RealignerTargetCreator

Untill this far everything went well. Now I'm trying the following command; java -jar GenomeAnalysisTK.jar -T IndelRealigner -R [.fasta] -l [ReorderedSam.bam] -targetIntervals [aligner.intervals] -o output.bam (Also when applying -known and an .vcf file Im producing the same error):

ERROR MESSAGE: Unable to match: GATK_7-PicardReorderSam.bam to a logging level, make sure it's a valid level (DEBUG, INFO, WARN, ERROR, FATAL, OFF)

I hope you can help me, because I can't find anything related on google.