New machine learning-based single-cell search engine makes cell annotation faster, more efficient
In a Q&A, machine learning expert Mehrtash Babadi introduces Cell Annotation Service, a search engine for single-cell data that he and his group have developed for biologists.
One of the first steps for researchers in studying and analyzing single cells is to determine the cells’ identity: what type and subtype of cells are these, and how similar or different are they to previously analyzed cells? Scientists then annotate the cells with this information, a process that can take days or even weeks, depending on the number of cells being labeled, and requires labor-intensive literature and database searches.
To speed up the annotation step, the Broad Institute’s Data Sciences Platform (DSP) has developed a new search engine that automates much of this process by using machine learning to search data on more than 50 million annotated single cells. The tool, Cell Annotation Service (CAS), promises to reduce cell annotation time from many hours to just one, and was recently released in beta mode for scientists to use.
To learn more about CAS, we spoke with Mehrtash Babadi, an institute scientist and director of computational methods in DSP. Babadi leads the group that built the new tool.
How does CAS work?
CAS uses some of the same techniques behind reverse image search, which uses a search engine to find other images similar to the image you want to identify. We wanted to build a tool like that for cell biology. So we took lots of reference single-cell RNA sequencing data from atlases and used our scalable machine learning algorithms to embed all of the gene expression data on these cells into compact vector representations — you can think of these as a signature for each cell.
When you have a new cell you’re interested in studying, you can use CAS to compare and match your new cell with all these reference cells based on their signatures, and nominate cells that are similar to yours. It’s basically a search engine. You give it a cell, and it shows you similar cells. And when you give it a single-cell dataset, it generates annotations and labels for you by doing this search and carrying over the labels from similar cells to your cells.
How did you build the search engine?
Several components of CAS were initially funded by the NIH through the Center for Human Brain Variation at the Broad Institute, where I serve as a co-investigator. We developed the Cellarium AI platform, which powers CAS, to support researchers at the center analyzing massive datasets generated from studying hundreds of human brains, spanning multiple brain regions and tens of thousands of cells per region. Around 2022, we were in discussions with 10x Genomics about potential collaborative research projects. During these conversations, we realized that the platform could be applied beyond its initial scope. CAS emerged as one of these applications, with additional funding provided by 10x Genomics.
As the first step, we built a software platform that could store vast amounts of single-cell data, query out these data, and then use that data to train large machine learning models and generate these embeddings, or signatures, from lots of single-cell data. We trained our models on close to 87 million cells from nearly 1,400 published studies — all of the cells in the CZ CELLxGENE repository, which has been built and curated by the Chan Zuckerberg Initiative. CZ CELLxGENE made sure these datasets were harmonized at the level of the metadata attached to the cells, which made the datasets really useful for machine learning.
Can you give some examples of how biologists can use CAS and what they can learn from it?
One application is determining cell type. Let’s say you have a cell and you know its gene expression profile. You want to know: what is the crude type of the cell? Is it a T cell? If it's a T-cell, is it a CD8+ T cell? If it is, is it like a naive, thymus-derived CD8+ T cell? Just by entering the gene expression profile of your new cell, you can narrow down the possibilities of what cell type you're dealing with.
Another application is to identify whether the cell state you’re seeing is something typically encountered in tissues from healthy donors or in tissues from people with a particular disease. You can also ask: is this cell specific to the tissue you are studying or is it common to multiple tissues?
Let’s say you have a therapeutic that is targeting a specific cell state identified in the context of a certain disease. You may want to know whether the same disease mechanism that is driven by these cells is present in other diseases. If the answer is yes, then you have a good hypothesis to extend the indication for that therapeutic to now include the new diseases.
Is CAS now available to use?
Yes. The CAS model and framework we developed in collaboration with 10x Genomics is now offered to users in 10x Genomics’ Cloud Analysis Automated Cell Annotation pipeline. 10x Genomics is a provider of instruments and assays for single-cell analysis and the first interaction many users have with their single-cell data is through 10x software. We thought it would be interesting if that initial interaction could be more informative, so that not only would you see technical information about your experiment, such as the number of sequenced cells and their quality, but you’d also be able to learn more about those cells, like what cell types they are and all of the things we’ve talked about here.
To make CAS accessible to a broader audience, including those looking to integrate the service into their own interactive or batch analysis workflows, we’re launching our implementation of CAS as a public beta service. Users can sign up by navigating to the CAS landing page, scrolling to the bottom of the page, and filling out the sign-up form.
During the beta phase, CAS is offered at no cost, with a usage limit of 100,000 individually annotated cells per week and 200,000 individually annotated cells in total. This quota lets us provide the service to a larger and more diverse user base. Currently, the embedding model powering CAS is the same as the cell annotation pipeline offered by 10x Genomics, though future models and features may evolve separately in alignment with the development roadmaps of each organization.
Overall, how can AI help advance cell biology?
One way is to make information more accessible and more integrated. We’re hoping CAS is taking the first step in that direction, by just making information more findable.
The second way is to integrate all of the cell biology knowledge we have accumulated and keep accumulating into a cohesive fabric. Nowadays the paradigm is to build very large foundation models that have an integrated understanding of all the data that we've generated. This would allow us to make good predictions, by fine-tuning these models on cellular perturbation experiments, and this could potentially help unravel mechanisms underlying cell function that have remained hidden so far. This second problem is a different type of problem and it’s much harder. It's not just about being able to make old data more easily findable, but it's about being able to synthesize new data based on old data. That is our main vision for the future of all the work we're doing.