Compress the read data in order to minimize file sizes, which facilitates massively multisample processing.
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ -T ReduceReads \ -R reference.fa \ -I recal_reads.bam \ -L 20 \ -o reduced_reads.bam
This creates a file called
reduced_reads.bam containing only the sequence information that is essential for calling variants.
Note that ReduceReads is not meant to be run on multiple samples at once. If you plan on merging your sample bam files, you should run ReduceReads on individual samples before doing so.
When running reduce reads, the algorithm will find regions of low variation in the genome and compress them together. To represent this compressed region, we use a synthetic read that carries all the information necessary to downstream tools to perform likelihood calculations over the reduced data.
They are called Synthetic because they are not read by a sequencer, these reads are automatically generated by the GATK and can be extremely long. In a synthetic read, each base will represent the consensus base for that genomic location. Each base will have it's consensus quality score represented in the equivalent offset in the quality score string.
ReduceReads has several filtering parameters for consensus regions. Consensus is created based on base qualities, mapping qualities and other adjustable parameters from the command line. All filters are described in the technical documentation of reduce reads.
The consensus quality score of a consensus base is essentially the mean of all bases that passed all the filters and represent an observation of that base. It is represented in the quality score field of the SAM format.
n is the number of bases that contributed to the consensus base and q_i is the corresponding quality score of each base.
Insertion quality scores and Deletion quality scores (generated by BQSR) will undergo the same process and will be represented the same way.
The mapping quality of a synthetic read is a value representative of the mapping qualities of all the reads that contributed to it. This is an average of the root mean square of the mapping quality of all reads that contributed to the bases of the synthetic read. It is represented in the mapping quality score field of the SAM format.
where n is the number of reads and x_i is the mapping quality of each read.
A synthetic read may come with up to two extra tags representing its original alignment information. Due to many filters in ReduceReads, reads are hard-clipped to the are of interest. These hard-clips are always represented in the cigar string with the H element and the length of the clipping in genomic coordinates. Sometimes hard clipping will make it impossible to retrieve what was the original alignment start / end of a read. In those cases, the read will contain extra tags with integer values representing their original alignment start or end.
Here are the two integer tags:
For all other reads, where this can still be obtained through the cigar string (i.e. using getAlignmentStart() or getUnclippedStart()), these tags are not created.
the RR tag is a tag that holds the observed depth (after filters) of every base that contributed to a reduce read. That means all bases that passed the mapping and base quality filters, and had the same observation as the one in the reduced read.
The RR tag carries an array of bytes and for increased compression, it works like this: the first number represents the depth of the first base in the reduced read. all subsequent numbers will represent the offset depth from the first base. Therefore, to calculate the depth of base "i" using the RR array, one must use :
RR + RR[i]
but make sure i > 0. Here is the code we use to return the depth of the i'th base:
return (i==0) ? firstCount : (byte) Math.min(firstCount + offsetCount, Byte.MAX_VALUE);
The GATK is 100% compatible with synthetic reads. You can use Reduced BAM files in combination with non-reduced BAM files in any GATK analysis tools and it will work seamlessly.
If you are programming using the GATK framework, the GATKSAMRecord class carries all the necessary functionality to use synthetic reads transparently with methods like:
GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.
The purpose of the ReduceReads algorithm was to enable joint analysis of large cohorts by the UnifiedGenotyper. The new workflow for joint discovery, which involves doing a single-sample pass with the HaplotypeCaller in gVCF mode followed by a joint analysis on multiple sample gVCFs, renders the compression step obsolete.
In addition, based on our most recent analyses, we have come to the conclusion that the quality of variant calls made on BAMs compressed with ReduceReads is inferior to the standard targeted by GATK tools. In comparison, the results obtained with the new workflow are far superior.
For these reasons, we have made the difficult decision to remove the ReduceReads tool from version 3.0 of the toolkit. To be clear, reduced BAMs will NOT be supported in GATK 3.0.
We realize that this may cause some disruption to your existing workflows, and for that we apologize. Please understand that we are driven to provide tools that produce the best possible results. Now that all the data is in, we have found that the best results cannot be achieved with reduced BAMs, so we feel that the best thing to do is to remove this inferior tool from the toolkit, and promote the new tools.
As always we welcome your comments, and we look forward to showing you how the new calling workflow will yield superior results.
GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.
GATK 2.7 was released on August 21, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
We have identified a major bug in ReduceReads -- GATK versions 2.0 and 2.1. The effect of the bug is that variant regions with more than 100 reads and fewer than 250 reads get downsampled to 0 reads.
This has now been fixed in the most recent release.
To check if you are using a buggy version, run the following:
samtools view -H $BAM
This will produce the following output:
@PG ID:GATK ReduceReads VN:XXX
If XXX is 2.0 or 2.1, any results obtained with your current version are suspect, and you will need to upgrade to the most recent version then rerun your processing.
Our most sincere apologies for the inconvenience.