Tagged with #alignment
1 documentation article | 0 announcements | 0 forum discussions


Comments (4)

Sting BWA/C Bindings

WARNING: This tool is experimental and unsupported and just starting to be developed and used and should be considered a beta version. Feel free to report bugs but we are not supporting the tool

The GSA group has made bindings available for Heng Li's Burrows-Wheeler Aligner (BWA). Our aligner bindings present additional functionality to the user not traditionally available with BWA. BWA standalone is optimized to do fast, low-memory alignments from Fastq to BAM. While our bindings aim to provide support for reasonably fast, reasonably low memory alignment, we add the capacity to do exploratory data analyses. The bindings can provide all alignments for a given read, allowing a user to walk over the alignments and see information not typically provided in the BAM format. Users of the bindings can 'go deep', selectively relaxing alignment parameters one read at a time, looking for the best alignments at a site.

The BWA/C bindings should be thought of as alpha release quality. However, we aim to be particularly responsive to issues in the bindings as they arise. Because of the bindings' alpha state, some functionality is limited; see the Limitations section below for more details on what features are currently supported.

Contents

A note about using the bindings

Whenever native code is called from Java, the user must assist Java in finding the proper shared library. Java looks for shared libraries in two places, on the system-wide library search path and through Java properties invoked on the command line. To add libbwa.so to the global library search path, add the following to your .my.bashrc, .my.cshrc, or other startup file:

bash
export LD_LIBRARY_PATH=/humgen/gsa-scr1/GATK_Data/bwa/stable:$LD_LIBRARY_PATH
csh
setenv LD_LIBRARY_PATH /humgen/gsa-scr1/GATK_Data/bwa/stable:$LD_LIBRARY_PATH

To specify the location of libbwa.so directly on the command-line, use the java.library.path system property as follows:

java -Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
    -jar dist/GenomeAnalysisTK.jar \
    -T AlignmentValidation \
    -I /humgen/gsa-hphome1/hanna/reference/1kg/NA12878_Pilot1_20.bwa.bam \
    -R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta

Preparing to use the aligner

Within the Broad Institute

We provide internally accessible versions of both the BWA shared library and precomputed BWA indices for two commonly used human references at the Broad (Homo_sapiens_assembly18.fasta and human_b36_both.fasta). These files live in the following directory:

/humgen/gsa-scr1/GATK_Data/bwa/stable

Outside of the Broad Institute

Two steps are required in preparing to use the aligner: building the shared library and using BWA/C to generate an index of the reference sequence.

The Java bindings to the aligner are available through the Sting repository. A precompiled version of the bindings are available for Linux; these bindings are available in c/bwa/libbwa.so.1. To build the aligner from source:

  • Fetch the latest svn of BWA from SourceForge. Configure and build BWA.
sh autogen.sh
./configure
make
  • Download the latest version of Sting from our Github repository.
  • Customize the variables at the top one of the build scripts (c/bwa/build_linux.sh,c/bwa/build_mac.sh) based on your environment. Run the build script.

To build a reference sequence, use the BWA C executable directly:

bwa index -a bwtsw <your reference sequence>.fasta

Using the existing GATK alignment walkers

Two walkers are provided for end users of the GATK. The first of the stock walkers is Align, which can align an unmapped BAM file or realign a mapped BAM file.

java \
-Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
    -jar dist/GenomeAnalysisTK.jar \
    -T Align \
    -I NA12878_Pilot1_20.unmapped.bam \
    -R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta \
    -U \
    -ob human.unsorted.bam

Most of the available parameters here are standard GATK. -T specifies that the alignment analysis should be used; -I specifies the unmapped BAM file to align, and the -R specifies the reference to which to align. By default, this walker assumes that the bwa index support files will live alongside the reference. If these files are stored elsewhere, the optional -BWT argument can be used to specify their location. By defaults, alignments will be emitted to the console in SAM format. Alignments can be spooled to disk in SAM format using the -o option or spooled to disk in BAM format using the -ob option.

The other stock walker is AlignmentValidation, which computes all possible alignments based on the BWA default configuration settings and makes sure at least one of the top alignments matches the alignment stored in the read.

java \
-Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
    -jar dist/GenomeAnalysisTK.jar \
    -T AlignmentValidation \
    -I /humgen/gsa-hphome1/hanna/reference/1kg/NA12878_Pilot1_20.bwa.bam \
    -R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta

Options for the AlignmentValidation walker are identical to the Alignment walker, except the AlignmentValidation walker's only output is a exception if validation fails.

Another sample walker of limited scope, CountBestAlignmentsWalker, is available for review; it is discussed in the example section below.

Writing new GATK walkers utilizing alignment bindings

BWA/C can be created on-the-fly using the org.broadinstitute.sting.alignment.bwa.c.BWACAligner constructor. The bindings have two sets of interfaces: an interface which returns all possible alignments and an interface which randomly selects an alignment from a list of the top scoring alignments as selected by BWA.

To iterate through all functions, use the following method:

    /**
     * Get a iterator of alignments, batched by mapping quality.
     * @param bases List of bases.
     * @return Iterator to alignments.
     */
    public Iterable<Alignment[]> getAllAlignments(final byte[] bases);

The call will return an Iterable which batches alignments by score. Each call to next() on the provided iterator will return all Alignments of a given score, ordered in best to worst. For example, given a read sequence with at least one match on the genome, the first call to next() will supply all exact matches, and subsequent calls to next() will give alignments judged to be inferior by BWA (alignments containing mismatches, gap opens, or gap extensions).

Alignments can be transformed to reads using the following static method in org.broadinstitute.sting.alignment.Alignment:

    /**
     * Creates a read directly from an alignment.
     * @param alignment The alignment to convert to a read.
     * @param unmappedRead Source of the unmapped read.  Should have bases, quality scores, and flags.
     * @param newSAMHeader The new SAM header to use in creating this read.  Can be null, but if so, the sequence
     *                     dictionary in the
     * @return A mapped alignment.
     */
    public static SAMRecord convertToRead(Alignment alignment, SAMRecord unmappedRead, SAMFileHeader newSAMHeader);

A convenience method is available which allows the user to get SAMRecords directly from the aligner.

    /**
     * Get a iterator of aligned reads, batched by mapping quality.
     * @param read Read to align.
     * @param newHeader Optional new header to use when aligning the read.  If present, it must be null.
     * @return Iterator to alignments.
     */
    public Iterable<SAMRecord[]> alignAll(final SAMRecord read, final SAMFileHeader newHeader);

To return a single read randomly selected by the bindings, use one of the following methods:

    /**
     * Allow the aligner to choose one alignment randomly from the pile of best alignments.
     * @param bases Bases to align.
     * @return An align
     */
    public Alignment getBestAlignment(final byte[] bases);

    /**
     * Align the read to the reference.
     * @param read Read to align.
     * @param header Optional header to drop in place.
     * @return A list of the alignments.
     */
    public SAMRecord align(final SAMRecord read, final SAMFileHeader header);

The org.broadinstitute.sting.alignment.bwa.BWAConfiguration argument allows the user to specify parameters normally specified to 'bwt aln'. Available parameters are:

  • Maximum edit distance (-n)
  • Maximum gap opens (-o)
  • Maximum gap extensions (-e)
  • Disallow an indel within INT bp towards the ends (-i)
  • Mismatch penalty (-M)
  • Gap open penalty (-O)
  • Gap extension penalty (-E)

Settings must be supplied to the constructor; leaving any BWAConfiguration field unset means that BWA should use its default value for that argument. Configuration settings can be updated at any time using the BWACAligner updateConfiguration method.

    public void updateConfiguration(BWAConfiguration configuration);

Running the aligner outside of the GATK

The BWA/C bindings were written with running outside of the GATK in mind, but this workflow has never been tested. If you would like to run the bindings outside of the GATK, you will need:

  • The BWA shared object, libbwa.so.1
  • The packaged version of Aligner.jar

To build the packaged version of the aligner, run the following command

cp $STING_HOME/lib/bcel-*.jar ~/.ant/lib
ant package -Dexecutable=Aligner

This command will extract all classes required to run the aligner and place them in $STING_HOME/dist/packages/Aligner/Aligner.jar. You can then specify this one jar in your project's dependencies.

Limitations

The BWA/C bindings are currently in an alpha state, but they are extensively supported. Because of the bindings' alpha state, some functionality is limited. The limitations of these bindings include:

  • Only single-end alignment is supported. However, a paired end module could be implemented as a simple extension that finds the jointly optimal placement of both singly-aligned ends.
  • Color space alignments are not currently supported.
  • Only a limited number of parameters BWA's extensive parameter list are supported. The current list of supported parameters is specified in the 'Writing new GATK walkers utilizing alignment bindings' section below.
  • The system is not as heavily memory-optimized as the BWA/C implementation standalone. The JVM, by default, uses slightly over 4G of resident memory when running BWA on human. We have not done extensive testing on the behavior of the BWA/C bindings under memory pressure.
  • There is a slight negative impact on performance when using the BWA/C bindings. BWA/C standalone on 6.9M reads of human data takes roughly 45min to run 'bwa aln', 5min to run 'bwa samse', and another 1.5min to convert the resulting SAM file to a BAM. Aligning the same dataset using the Java bindings takes approximately 55 minutes.
  • The GATK requires that its input BAMs be sorted and indexed. Before using the Align or AlignmentValidation walker, you must sort and index your unmapped BAM file. Note that this is a limitation of the GATK, not the aligner itself. Using the alignment support files outside of the GATK will eliminate this requirement.

Example: analysis of alignments with the BWA bindings

In order to validate that the Java bindings were computing the same number of reads as BWA/C standalone, we modified the BWA source to gather the number of equally scoring alignments and the frequency of the number of equally scoring alignments. We then implemented the same using a walker written in the GATK. We computed this distribution over a set of 36bp human reads and found the distributions to be identical.

The relevant parts of the walker follow.

public class CountBestAlignmentsWalker extends ReadWalker<Integer,Integer> {
    /**
     * The supporting BWT index generated using BWT.
     */
    @Argument(fullName="BWTPrefix",shortName="BWT",doc="Index files generated by bwa index -d bwtsw",required=false)
    String prefix = null;

    /**
     * The actual aligner.
     */
    private Aligner aligner = null;

    private SortedMap<Integer,Integer> alignmentFrequencies = new TreeMap<Integer,Integer>();

    /**
     * Create an aligner object.  The aligner object will load and hold the BWT until close() is called.
     */
    @Override
    public void initialize() {
        BWTFiles bwtFiles = new BWTFiles(prefix);
        BWAConfiguration configuration = new BWAConfiguration();
        aligner = new BWACAligner(bwtFiles,configuration);
    }

    /**
     * Aligns a read to the given reference.
     * @param ref Reference over the read.  Read will most likely be unmapped, so ref will be null.
     * @param read Read to align.
     * @return Number of alignments found for this read.
     */
    @Override
    public Integer map(char[] ref, SAMRecord read) {
        Iterator<Alignment[]> alignmentIterator = aligner.getAllAlignments(read.getReadBases()).iterator();
        if(alignmentIterator.hasNext()) {
            int numAlignments = alignmentIterator.next().length;
            if(alignmentFrequencies.containsKey(numAlignments))
                alignmentFrequencies.put(numAlignments,alignmentFrequencies.get(numAlignments)+1);
            else
                alignmentFrequencies.put(numAlignments,1);
        }
        return 1;
    }    

    /**
     * Initial value for reduce.  In this case, validated reads will be counted.
     * @return 0, indicating no reads yet validated.
     */
    @Override
    public Integer reduceInit() { return 0; }

    /**
     * Calculates the number of reads processed.
     * @param value Number of reads processed by this map.
     * @param sum Number of reads processed before this map.
     * @return Number of reads processed up to and including this map.
     */
    @Override
    public Integer reduce(Integer value, Integer sum) {
        return value + sum;
    }

    /**
     * Cleanup.
     * @param result Number of reads processed.
     */
    @Override
    public void onTraversalDone(Integer result) {
        aligner.close();
        for(Map.Entry<Integer,Integer> alignmentFrequency: alignmentFrequencies.entrySet())
            out.printf("%d\t%d%n", alignmentFrequency.getKey(), alignmentFrequency.getValue());
        super.onTraversalDone(result);
    }
}

This walker can be run within the svn version of the GATK using -T CountBestAlignments.

The resulting placement count frequency is shown in the graph below. The number of placements clearly follows an exponential.

Bwa dist.png

Validation methods

Two major techniques were used to validate the Java bindings against the current BWA implementation.

  • Fastq files from E coli and from NA12878 chr20 were aligned using BWA standalone with BWA's default settings. The aligned SAM files were sorted, indexed, and fed into the alignment validation walker. The alignment validation walker verified that one of the top scoring matches from the BWA bindings matched the alignment produced by BWA standalone.
  • Fastq files from E coli and from NA12878 chr20 were aligned using the GATK Align walker, then fed back into the GATK's alignment validation walker.
  • The distribution of the alignment frequency was compared between BWA standalone and the Java bindings and was found to be identical.

As an ongoing validation strategy, we will use the GATK integration test suite to align a small unmapped BAM file with human data. The contents of the unmapped BAM file will be aligned and written to disk. The md5 of the resulting file will be calculated and compared to a known good md5.

Unsupported: using the BWA/C bindings from within Matlab

Some users are attempting to use the BWA/C bindings from within Matlab. To run the GATK within Matlab, you'll need to add libbwa.so to your library path through the librarypath.txt file. The librarypath.txt file normally lives in $matlabroot/toolbox/local. Within the Broad Institute, the $matlabroot/toolbox/local/librarypath.txt file is shared; therefore, you'll have to create a librarypath.txt file in your working directory from which you execute matlab.

##
## FILE: librarypath.txt
##
## Entries:
##    o path_to_jnifile
##    o [alpha,glnx86,sol2,unix,win32,mac]=path_to_jnifile
##    o $matlabroot/path_to_jnifile
##    o $jre_home/path_to_jnifile
##
$matlabroot/bin/$arch
/humgen/gsa-scr1/GATK_Data/bwa/stable

Once you've edited the library path, you can verify that Matlab has picked up your modified file by running the following command:

>> java.lang.System.getProperty('java.library.path')
     
ans =
/broad/tools/apps/matlab2009b/bin/glnxa64:/humgen/gsa-scr1/GATK_Data/bwa/stable

Once the location of libbwa.so has been added to the library path, you can use the BWACAligner just as you would any other Java class in Matlab:

>> javaclasspath({'/humgen/gsa-scr1/hanna/src/Sting/dist/packages/Aligner/Aligner.jar'})
>> import org.broadinstitute.sting.alignment.bwa.BWTFiles
>> import org.broadinstitute.sting.alignment.bwa.BWAConfiguration
>> import org.broadinstitute.sting.alignment.bwa.c.BWACAligner
>> x = BWACAligner(BWTFiles('/humgen/gsa-scr1/GATK_Data/bwa/Homo_sapiens_assembly18.fasta'),BWAConfiguration())
>> y=x.getAllAlignments(uint8('CCAATAACCAAGGCTGTTAGGTATTTTATCAGCAATGTGGGATAAGCAC'));

We don't have the resources to directly support using the BWA/C bindings from within Matlab, but if you report problems to us, we will try to address them.

No posts found with the requested search criteria.
No posts found with the requested search criteria.