Very large genome sequencing datasets can be a pain to work with. Beyond the problem of storing the large amounts of data involved, processing a lot of data simultaneously (for example in multisample calling experiments that involve large cohorts) is a huge computational challenge. To mitigate this problem, we have developed a novel algorithm that allows us to compress large portions of the read data into consensus reads that retain information useful for variant calling (such as coverage depth, base and mapping quality scores etc) yet take up a smaller computational footprint. The compression process occurs in a single straightforward step that produces a new BAM file containing the reduced data. To be clear, this compression mode is NOT meant to provide a solution for long-term storage; you should always retain a copy of the unreduced data.
Using ReduceReads on your BAM files will cut down the sizes to approximately 1/100 of their original sizes, allowing the GATK to process tens of thousands of samples simultaneously without excessive I/O and processing burdens. Even for single samples ReduceReads cuts the memory requirements, I/O burden, and CPU costs of downstream tools significantly (10x or more).