WARNING: This tool is experimental and unsupported and just starting to be developed and used and should be considered a beta version. Feel free to report bugs but we are not supporting the tool
The GSA group has made bindings available for Heng Li's Burrows-Wheeler Aligner (BWA). Our aligner bindings present additional functionality to the user not traditionally available with BWA. BWA standalone is optimized to do fast, low-memory alignments from Fastq to BAM. While our bindings aim to provide support for reasonably fast, reasonably low memory alignment, we add the capacity to do exploratory data analyses. The bindings can provide all alignments for a given read, allowing a user to walk over the alignments and see information not typically provided in the BAM format. Users of the bindings can 'go deep', selectively relaxing alignment parameters one read at a time, looking for the best alignments at a site.
The BWA/C bindings should be thought of as alpha release quality. However, we aim to be particularly responsive to issues in the bindings as they arise. Because of the bindings' alpha state, some functionality is limited; see the Limitations section below for more details on what features are currently supported.
Whenever native code is called from Java, the user must assist Java in finding the proper shared library. Java looks for shared libraries in two places, on the system-wide library search path and through Java properties invoked on the command line. To add libbwa.so to the global library search path, add the following to your .my.bashrc, .my.cshrc, or other startup file:
export LD_LIBRARY_PATH=/humgen/gsa-scr1/GATK_Data/bwa/stable:$LD_LIBRARY_PATH
setenv LD_LIBRARY_PATH /humgen/gsa-scr1/GATK_Data/bwa/stable:$LD_LIBRARY_PATH
To specify the location of libbwa.so directly on the command-line, use the java.library.path system property as follows:
java -Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
-jar dist/GenomeAnalysisTK.jar \
-T AlignmentValidation \
-I /humgen/gsa-hphome1/hanna/reference/1kg/NA12878_Pilot1_20.bwa.bam \
-R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta
We provide internally accessible versions of both the BWA shared library and precomputed BWA indices for two commonly used human references at the Broad (Homo_sapiens_assembly18.fasta and human_b36_both.fasta). These files live in the following directory:
/humgen/gsa-scr1/GATK_Data/bwa/stable
Two steps are required in preparing to use the aligner: building the shared library and using BWA/C to generate an index of the reference sequence.
The Java bindings to the aligner are available through the Sting repository. A precompiled version of the bindings are available for Linux; these bindings are available in c/bwa/libbwa.so.1. To build the aligner from source:
sh autogen.sh ./configure make
To build a reference sequence, use the BWA C executable directly:
bwa index -a bwtsw <your reference sequence>.fasta
Two walkers are provided for end users of the GATK. The first of the stock walkers is Align, which can align an unmapped BAM file or realign a mapped BAM file.
java \
-Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
-jar dist/GenomeAnalysisTK.jar \
-T Align \
-I NA12878_Pilot1_20.unmapped.bam \
-R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta \
-U \
-ob human.unsorted.bam
Most of the available parameters here are standard GATK. -T specifies that the alignment analysis should be used; -I specifies the unmapped BAM file to align, and the -R specifies the reference to which to align. By default, this walker assumes that the bwa index support files will live alongside the reference. If these files are stored elsewhere, the optional -BWT argument can be used to specify their location. By defaults, alignments will be emitted to the console in SAM format. Alignments can be spooled to disk in SAM format using the -o option or spooled to disk in BAM format using the -ob option.
The other stock walker is AlignmentValidation, which computes all possible alignments based on the BWA default configuration settings and makes sure at least one of the top alignments matches the alignment stored in the read.
java \
-Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
-jar dist/GenomeAnalysisTK.jar \
-T AlignmentValidation \
-I /humgen/gsa-hphome1/hanna/reference/1kg/NA12878_Pilot1_20.bwa.bam \
-R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta
Options for the AlignmentValidation walker are identical to the Alignment walker, except the AlignmentValidation walker's only output is a exception if validation fails.
Another sample walker of limited scope, CountBestAlignmentsWalker, is available for review; it is discussed in the example section below.
BWA/C can be created on-the-fly using the org.broadinstitute.sting.alignment.bwa.c.BWACAligner constructor. The bindings have two sets of interfaces: an interface which returns all possible alignments and an interface which randomly selects an alignment from a list of the top scoring alignments as selected by BWA.
To iterate through all functions, use the following method:
/**
* Get a iterator of alignments, batched by mapping quality.
* @param bases List of bases.
* @return Iterator to alignments.
*/
public Iterable<Alignment[]> getAllAlignments(final byte[] bases);
The call will return an Iterable which batches alignments by score. Each call to next() on the provided iterator will return all Alignments of a given score, ordered in best to worst. For example, given a read sequence with at least one match on the genome, the first call to next() will supply all exact matches, and subsequent calls to next() will give alignments judged to be inferior by BWA (alignments containing mismatches, gap opens, or gap extensions).
Alignments can be transformed to reads using the following static method in org.broadinstitute.sting.alignment.Alignment:
/**
* Creates a read directly from an alignment.
* @param alignment The alignment to convert to a read.
* @param unmappedRead Source of the unmapped read. Should have bases, quality scores, and flags.
* @param newSAMHeader The new SAM header to use in creating this read. Can be null, but if so, the sequence
* dictionary in the
* @return A mapped alignment.
*/
public static SAMRecord convertToRead(Alignment alignment, SAMRecord unmappedRead, SAMFileHeader newSAMHeader);
A convenience method is available which allows the user to get SAMRecords directly from the aligner.
/**
* Get a iterator of aligned reads, batched by mapping quality.
* @param read Read to align.
* @param newHeader Optional new header to use when aligning the read. If present, it must be null.
* @return Iterator to alignments.
*/
public Iterable<SAMRecord[]> alignAll(final SAMRecord read, final SAMFileHeader newHeader);
To return a single read randomly selected by the bindings, use one of the following methods:
/**
* Allow the aligner to choose one alignment randomly from the pile of best alignments.
* @param bases Bases to align.
* @return An align
*/
public Alignment getBestAlignment(final byte[] bases);
/**
* Align the read to the reference.
* @param read Read to align.
* @param header Optional header to drop in place.
* @return A list of the alignments.
*/
public SAMRecord align(final SAMRecord read, final SAMFileHeader header);
The org.broadinstitute.sting.alignment.bwa.BWAConfiguration argument allows the user to specify parameters normally specified to 'bwt aln'. Available parameters are:
Settings must be supplied to the constructor; leaving any BWAConfiguration field unset means that BWA should use its default value for that argument. Configuration settings can be updated at any time using the BWACAligner updateConfiguration method.
public void updateConfiguration(BWAConfiguration configuration);
The BWA/C bindings were written with running outside of the GATK in mind, but this workflow has never been tested. If you would like to run the bindings outside of the GATK, you will need:
To build the packaged version of the aligner, run the following command
cp $STING_HOME/lib/bcel-*.jar ~/.ant/lib ant package -Dexecutable=Aligner
This command will extract all classes required to run the aligner and place them in $STING_HOME/dist/packages/Aligner/Aligner.jar. You can then specify this one jar in your project's dependencies.
The BWA/C bindings are currently in an alpha state, but they are extensively supported. Because of the bindings' alpha state, some functionality is limited. The limitations of these bindings include:
In order to validate that the Java bindings were computing the same number of reads as BWA/C standalone, we modified the BWA source to gather the number of equally scoring alignments and the frequency of the number of equally scoring alignments. We then implemented the same using a walker written in the GATK. We computed this distribution over a set of 36bp human reads and found the distributions to be identical.
The relevant parts of the walker follow.
public class CountBestAlignmentsWalker extends ReadWalker<Integer,Integer> {
/**
* The supporting BWT index generated using BWT.
*/
@Argument(fullName="BWTPrefix",shortName="BWT",doc="Index files generated by bwa index -d bwtsw",required=false)
String prefix = null;
/**
* The actual aligner.
*/
private Aligner aligner = null;
private SortedMap<Integer,Integer> alignmentFrequencies = new TreeMap<Integer,Integer>();
/**
* Create an aligner object. The aligner object will load and hold the BWT until close() is called.
*/
@Override
public void initialize() {
BWTFiles bwtFiles = new BWTFiles(prefix);
BWAConfiguration configuration = new BWAConfiguration();
aligner = new BWACAligner(bwtFiles,configuration);
}
/**
* Aligns a read to the given reference.
* @param ref Reference over the read. Read will most likely be unmapped, so ref will be null.
* @param read Read to align.
* @return Number of alignments found for this read.
*/
@Override
public Integer map(char[] ref, SAMRecord read) {
Iterator<Alignment[]> alignmentIterator = aligner.getAllAlignments(read.getReadBases()).iterator();
if(alignmentIterator.hasNext()) {
int numAlignments = alignmentIterator.next().length;
if(alignmentFrequencies.containsKey(numAlignments))
alignmentFrequencies.put(numAlignments,alignmentFrequencies.get(numAlignments)+1);
else
alignmentFrequencies.put(numAlignments,1);
}
return 1;
}
/**
* Initial value for reduce. In this case, validated reads will be counted.
* @return 0, indicating no reads yet validated.
*/
@Override
public Integer reduceInit() { return 0; }
/**
* Calculates the number of reads processed.
* @param value Number of reads processed by this map.
* @param sum Number of reads processed before this map.
* @return Number of reads processed up to and including this map.
*/
@Override
public Integer reduce(Integer value, Integer sum) {
return value + sum;
}
/**
* Cleanup.
* @param result Number of reads processed.
*/
@Override
public void onTraversalDone(Integer result) {
aligner.close();
for(Map.Entry<Integer,Integer> alignmentFrequency: alignmentFrequencies.entrySet())
out.printf("%d\t%d%n", alignmentFrequency.getKey(), alignmentFrequency.getValue());
super.onTraversalDone(result);
}
}
This walker can be run within the svn version of the GATK using -T CountBestAlignments.
The resulting placement count frequency is shown in the graph below. The number of placements clearly follows an exponential.
Two major techniques were used to validate the Java bindings against the current BWA implementation.
As an ongoing validation strategy, we will use the GATK integration test suite to align a small unmapped BAM file with human data. The contents of the unmapped BAM file will be aligned and written to disk. The md5 of the resulting file will be calculated and compared to a known good md5.
Some users are attempting to use the BWA/C bindings from within Matlab. To run the GATK within Matlab, you'll need to add libbwa.so to your library path through the librarypath.txt file. The librarypath.txt file normally lives in $matlabroot/toolbox/local. Within the Broad Institute, the $matlabroot/toolbox/local/librarypath.txt file is shared; therefore, you'll have to create a librarypath.txt file in your working directory from which you execute matlab.
## ## FILE: librarypath.txt ## ## Entries: ## o path_to_jnifile ## o [alpha,glnx86,sol2,unix,win32,mac]=path_to_jnifile ## o $matlabroot/path_to_jnifile ## o $jre_home/path_to_jnifile ## $matlabroot/bin/$arch /humgen/gsa-scr1/GATK_Data/bwa/stable
Once you've edited the library path, you can verify that Matlab has picked up your modified file by running the following command:
>> java.lang.System.getProperty('java.library.path')
ans =
/broad/tools/apps/matlab2009b/bin/glnxa64:/humgen/gsa-scr1/GATK_Data/bwa/stable
Once the location of libbwa.so has been added to the library path, you can use the BWACAligner just as you would any other Java class in Matlab:
>> javaclasspath({'/humgen/gsa-scr1/hanna/src/Sting/dist/packages/Aligner/Aligner.jar'})
>> import org.broadinstitute.sting.alignment.bwa.BWTFiles
>> import org.broadinstitute.sting.alignment.bwa.BWAConfiguration
>> import org.broadinstitute.sting.alignment.bwa.c.BWACAligner
>> x = BWACAligner(BWTFiles('/humgen/gsa-scr1/GATK_Data/bwa/Homo_sapiens_assembly18.fasta'),BWAConfiguration())
>> y=x.getAllAlignments(uint8('CCAATAACCAAGGCTGTTAGGTATTTTATCAGCAATGTGGGATAAGCAC'));
We don't have the resources to directly support using the BWA/C bindings from within Matlab, but if you report problems to us, we will try to address them.
The GATK can be particular about the ordering of a BAM file. If you find yourself in the not uncommon situation of having created or received BAM files sorted in a bad order, you can use the tool ReorderSam to generate a new BAM file where the reads have been reordered to match a well-ordered reference file.
java -jar picard/ReorderSam.jar I= lexicographc.bam O= kayrotypic.bam REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta
This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. This tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a lift over tool!
The tool, though once in the GATK, is now part of the Picard package.
It is useful for fixing problems such as not having read groups in a bam file.
java -jar picard/AddOrReplaceReadGroups.jar I= testdata/exampleNORG.bam O= exampleNewRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=DePristo
Note that this tool is now part of the Picard package: http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups
This tool can fix BAM files without read group information:
# throws an error
java -jar dist/GenomeAnalysisTK.jar -R testdata/exampleFASTA.fasta -I testdata/exampleNORG.bam -T UnifiedGenotyper
# fix the read groups
java -jar picard/AddOrReplaceReadGroups.jar I= testdata/exampleNORG.bam O= exampleNewRG.bam SORT_ORDER=coordinate RGID=foo RGLB=bar RGPL=illumina RGSM=DePristo CREATE_INDEX=True
# runs without error
java -jar dist/GenomeAnalysisTK.jar -R testdata/exampleFASTA.fasta -I exampleNewRG.bam -T UnifiedGenotyper
BEAGLE is a state of the art software package for analysis of large-scale genetic data sets with hundreds of thousands of markers genotyped on thousands of samples. BEAGLE can
The GATK provides and experimental interface to BEAGLE. Currently, the only use cases supported by this interface are a) inferring missing genotype data from call sets (e.g. for lack of coverage in low-pass data), b) Genotype inference for unrelated individuals.
The basic workflow for this interface is as follows:
After variants are called and possibly filtered, the GATK walker ProduceBeagleInput will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.
Before running BEAGLE, we need to first take an input VCF file with genotype likelihoods and produce the BEAGLE likelihoods file using walker ProduceBealgeInput, as described in detail in its documentation page.
For each variant in inputvcf.vcf, ProduceBeagleInput will extract the genotype likelihoods, convert from log to linear space, and produce a BEAGLE input file in Genotype likelihoods file format (see BEAGLE documentation for more details). Essentially, this file is a text file in tabular format, a snippet of which is pasted below:
marker alleleA alleleB NA07056 NA07056 NA07056 NA11892 NA11892 NA11892
20:60251 T C 10.00 1.26 0.00 9.77 2.45 0.00
20:60321 G T 10.00 5.01 0.01 10.00 0.31 0.00
20:60467 G C 9.55 2.40 0.00 9.55 1.20 0.00
Note that BEAGLE only supports biallelic sites. Markers can have an arbitrary label, but they need to be in chromosomal order. Sites that are not genotyped in the input VCF (i.e. which are annotated with a "./." string and have no Genotype Likelihood annotation) are assigned a likelihood value of (0.33, 0.33, 0.33).
IMPORTANT: Due to BEAGLE memory restrictions, it's strongly recommended that BEAGLE be run on a separate chromosome-by-chromosome basis. In the current use case, BEAGLE uses RAM in a manner approximately proportional to the number of input markers. After BEAGLE is run and an output VCF is produced as described below, CombineVariants can be used to combine resulting VCF's, using the "-variantMergeOptions UNION" argument.
We currently only support a subset of BEAGLE functionality - only unphased, unrelated input likelihood data is supported. To run imputation analysis, run for example
java -Xmx4000m -jar path_to_beagle/beagle.jar like=path_to_beagle_output/beagle_output out=myrun
Extra BEAGLE arguments can be added as required.
Empirically, Beagle can run up to about ~800,000 markers with 4 GB of RAM. Larger chromosomes require additional memory.
BEAGLE will produce several output files. The following shell commands unzip the output files in preparation for their being processed, and put them all in the same place:
# unzip gzip'd files, force overwrite if existing
gunzip -f path_to_beagle_output/myrun.beagle_output.gprobs.gz
gunzip -f path_to_beagle_output/myrun.beagle_output.phased.gz
#rename also Beagle likelihood file to mantain consistency
mv path_to_beagle_output/beagle_output path_to_beagle_output/myrun.beagle_output.like
Once BEAGLE files are produced, we can update our original VCF with BEAGLE's data. This is done using the BeagleOutputToVCF tool.
The walker looks for the files specified with the -B(type,BEAGLE,file) triplets as above for the output posterior genotype probabilities, the output r^2 values and the output phased genotypes. The order in which these are given in the command line is arbitrary, but all three must be present for correct operation.
The output VCF has the new genotypes that Beagle produced, and several annotations are also updated. By default, the walker will update the per-genotype annotations GQ (Genotype Quality), the genotypes themselves, as well as the per-site annotations AF (Allele Frequency), AC (Allele Count) and AN (Allele Number).
The resulting VCF can now be used for further downstream analysis.
Assuming you have broken up your calls into Beagle by chromosome (as recommended above), you can use the CombineVariants tool to merge the resulting VCFs into a single callset.
java -jar /path/to/dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R reffile.fasta \
--out genome_wide_output.vcf \
-V:input1 beagle_output_chr1.vcf \
-V:input2 beagle_output_chr2.vcf \
.
.
.
-V:inputX beagle_output_chrX.vcf \
-type UNION -priority input1,input2,...,inputX
Contents |
This script converts a VCF file from one reference build to another. It runs 3 modules within our toolkit that are necessary for lifting over a VCF.
1. LiftoverVariants walker
2. sortByRef.pl to sort the lifted-over file
3. Filter out records whose ref field no longer matches the new reference
The liftOverVCF.pl script is available in our public source repository under the 'perl' directory. Instructions for pulling down our source are available here.
./liftOverVCF.pl -vcf calls.b36.vcf \ -chain b36ToHg19.broad.over.chain \ -out calls.hg19.vcf \ -gatk /humgen/gsa-scr1/ebanks/Sting_dev -newRef /seq/references/Homo_sapiens_assembly19/v0/Homo_sapiens_assembly19 -oldRef /humgen/1kg/reference/human_b36_both -tmp /broad/shptmp [defaults to /tmp]
Running the script with no arguments will show the usage:
Usage: liftOverVCF.pl
-vcf <input vcf>
-gatk <path to gatk trunk>
-chain <chain file>
-newRef <path to new reference prefix; we will need newRef.dict, .fasta, and .fasta.fai>
-oldRef <path to old reference prefix; we will need oldRef.fasta>
-out <output vcf>
-tmp <temp file location; defaults to /tmp>
Chain files from b36/hg18 to hg19 are located here within the Broad:
/humgen/gsa-hpprojects/GATK/data/Liftover_Chain_Files/
External users can get them off our ftp site:
location: ftp.broadinstitute.org username: gsapubftp-anonymous path: Liftover_Chain_Files
For a complete, detailed argument reference, refer to the GATK document page here.
DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:
Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.
For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.
DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.
To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument
-geneList /path/to/gene/list.txt
The provided gene list must be of the following format:
585 NM_001005484 chr1 + 58953 59871 58953 59871 1 58953, 59871, 0 OR4F5 cmpl cmpl 0,
587 NM_001005224 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F3 cmpl cmpl 0,
587 NM_001005277 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F16 cmpl cmpl 0,
587 NM_001005221 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F29 cmpl cmpl 0,
589 NM_001005224 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F3 cmpl cmpl 0,
589 NM_001005277 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F16 cmpl cmpl 0,
589 NM_001005221 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F29 cmpl cmpl 0,
If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at
/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt
If you supply the -geneList argument, DepthOfCoverage v3.0 will output an additional summary file that looks as follows:
Gene_Name Total_Cvg Avg_Cvg Sample_1_Total_Cvg Sample_1_Avg_Cvg Sample_1_Cvg_Q3 Sample_1_Cvg_Median Sample_1_Cvg_Q1
SORT1 594710 238.27 594710 238.27 165 245 330
NOTCH2 3011542 357.84 3011542 357.84 222 399 >500
LMNA 563183 186.73 563183 186.73 116 187 262
NOS1AP 513031 203.50 513031 203.50 91 191 290
Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.