RNA-Seq is a powerful technology used in transcriptomics in order to quantify the expression of all genes in a sample of interest. RNA-Seq can be performed with millions of short reads due to the availability of deeply sequenced and characterized genomes. During RNA-Seq, varying proportions of the reads do not align to the genome being examined and are then omitted from additional study. Reads failing to “map” back to the genome of interest can be especially crippling in single-cell analysis, as low amounts of mRNA are collected from a single cell and any loss of data would have a larger influence on further analysis.
To explore this problem, the present study analyzed RNA-Seq data from individual mouse immune cells. We observed that the reads which failed to map to the mouse genome were most likely to align to a human or E. coli reference. We also identified a portion of the previously unaligned reads as primer sequences. Furthermore, this study combined several existing RNA-Seq pipelines in order to quantify the amount of reads mapping to the reference genome, amount of reads mapping to genomes of other species, and other possible sources of contamination.
Ultimately this study will produce an open source software tool to analyze reads that fail to map, and improve our procedures and analysis of RNA-Seq profiles.
The community and experience at the Broad Institute is irreplaceable. The Broad Institute is not only a collection of top notch scientists, dedicated to the advancement of scientific discovery, but a community of mentors and students, who value learning. This generates an atmosphere of collaboration and enthusiasm that I believe to be critical to the development of any young scientist.