You are here

MIA Talks

Low dimensional embeddings of words and documents (and how they might apply to single-cell data)

March 31, 2021
Tutte Institute for Mathematics and Computing

Over the last decade the field of Natural Language Processing (NLP) has been overtaken by neural networks and deep learning. The latest models and algorithms, from word2vec to Google’s Universal Sentence Encoder and BERT, can perform seemingly magical feats and provide powerful tools to understand and analyze text documents.

This talk will seek to pick apart word and document embedding techniques from NLP, removing the neural networks and instead working with linear algebra and dimension reduction on large sparse matrices of counts that is very similar to a large amount of single-cell data. Such an approach can achieve results on par with neural network approaches, but are both simpler to understand the inner workings of, and are generalizable to much more diverse domains than NLP, including, hopefully, single-cell research. In particular all we really need to apply these kinds of techniques are large sparse matrices of counts and some notion of locality or cooccurrence of the feature columns of that matrix.

The goal of this talk is to hopefully open up some of the ideas that have been so effective in NLP and make them usable and accessible to a wider audience working with more diverse types of data, and initiate a conversation about what is required to make such techniques useful for the single-cell community.