The ComputeGenomeMask utility determines the alignability of each base in the reference genome.
Mask files are generated based on a fixed read length L. A base is considered alignable if a window of length L centered on the base is unique within the reference sequence.
The ComputeGenomeMask utility works by dividing the input genome into sequences of length L for every base position and then aligning these sequences against the reference genome (using BWA) and looking for unique matches.
The implementation of genome masking in Genome STRiP is likely to change in the future.
-R <fasta-file> : Reference sequence. : An indexed fasta file containing
the reference sequence to mask. : The fasta file must have been indexed using
BWA in preparation for BWA alignment. The fasta file must also be indexed
with 'samtools faidx' or the equivalent.
-readLength <length> : The size of the window to use for determining
alignability.
-sequence <sequenceName> : The name of the sequence to process (default is
to process the entire genome). This can be used to parallelize the
computation by chromosome.
-O <mask-file> : Output genome mask file (fasta format). Default is to
write to stdout. The mask file contains '0' at alignable positions and '1'
at non-alignable positions.ComputeGenomeMask is part of Genome STRiP, which is a third-party GATK library.
An example invocation is shown below:
export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}
java -Xmx2g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
org.broadinstitute.sv.apps.ComputeGenomeMask \
-R Homo_sapiens_assembly18.fasta \
-O Homo_sapiens_assembly18.mask.chr1.36.fasta \
-readLength 36 \
-sequence chr1
To generate a mask for the entire genome, it is generally preferably to compute the mask for each chromosome separately and then concatenate the output files in the correct order.
The following script provides an example of generating the appropiate indexes and running ComputeGenomeMask in parallel on each chromosome. This is an example and not a general purpose script; it will likely need to be modified for your environment. After each chromosome has been run, the resulting files must be concatenated together in the same order as the reference sequence to create the final mask.
#!/bin/bash
outdir=example readLength=75 reference=Homo_sapiens_hg19.fasta
export SV_DIR=/humgen/cnp04/bobh/svtoolkit/stable
#These executables must be on your path.
which java > /dev/null || exit 1
which bwa > /dev/null || exit 1
#The directory containing libbwa.so must be on your LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}
classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar"
mkdir -p ${outdir}/work
localReference=${outdir}/work/`echo ${reference} | awk -F / '{ print $NF }'`
if [ ! -e ${localReference} ]; then ln ${reference} ${localReference} || exit 1 fi
java -cp ${classpath} -Xmx4g \
org.broadinstitute.sv.apps.IndexFastaFile \
-I ${localReference} \
-O ${localReference}.fai \
|| exit 1
bwa index -a bwtsw ${localReference} || exit 1
chroms=`cat ${localReference}.fai | cut -f 1`
for chr in ${chroms}; do
bsub -o ${outdir}/work/svmask_${chr}.log \
-R "rusage[mem=5000]" \
java -cp ${classpath} -Xmx4g \
org.broadinstitute.sv.apps.ComputeGenomeMask \
-R ${localReference} \
-O ${outdir}/work/svmask_${chr}.fasta \
-readLength ${readLength} \
-sequence ${chr} \
|| exit 1
done