Reducing the size of the BAM files generated from DNA sequencing experiments

Mentors: Mauricio Carneiro

The BAM file is the de facto standard computer file format for storing DNA sequence data and sequence alignments of the whole genome. The traditional “full” BAM file contains the information for every recorded DNA sequence (i.e. every “read”) and its quality scores. Due to Next Generation sequencing technologies, the rise in amount of DNA sequencing data has outpaced disk storage capacity. An average human genome study of 60 people yields a 300 gigabyte BAM file. The goal of the “ReduceBAM” project is to reduce the file size of DNA sequence data, without affecting the accuracy of the analysis done on those data.

Roger and his mentor developed a new compression algorithm to eliminate data from the file, whenever a portion of the DNA sequence data is redundant and thus unnecessary. The proposed algorithm efficiently reduces file size 14-28 times, without losing accuracy. The smaller file size helps transfer BAM files faster, reduces hard drive space used, and can also translate into performance increases. 



Roger, a senior at Medford High School, developed an algorithm to make large sequencing data files 14-28 times smaller, and therefore easier to store and analyze.