Juan Caicedo may have the next wildly ambitious use of artificial intelligence: Bringing the power of foundation models into the world of cellular imaging. Foundation models are powering the next generation of machine learning applications, such as ChatGPT for natural language processing.
Caicedo, a Morgridge Institute biomedical imaging investigator and UW–Madison professor of biostatistics and medical informatics, is working on an image processing platform that could analyze cellular data from any experiment — regardless of the imaging modality, resolution or type of experiment — to identify common features about the phenotypic structure and morphology of cells.
Caicedo calls the concept “universal morphology,” and it has the potential to turbocharge our understanding of cell biology, drug discovery and disease biomarkers.
“It’s kind of difficult to have image analysis models that can be reused from experiment to experiment, but that is our goal,” says Caicedo. “We want to create a machine learning model that can process any type of microscopy image, so biologists don’t have to retrain machine learning models for every experiment. Hopefully, all they have to do is to download our model in order to solve bioimage analysis tasks.”
In order to develop the project, Caicedo needs cellular image data — lots of it. Unlike natural language chatbots, which are being trained with the almost infinite amount of text available online, Caicedo will need to curate some of the larger public cell image datasets that have been developed from many individual experiments.
Caicedo is working with the Center for High-Throughput Computing (CHTC) to get some of those publicly available datasets available in the UW–Madison community. They currently are targeting different publicly available datasets from around the world, which they estimate will be about hundreds of terabytes of data.
Of particular interest to the imaging challenge is the experience of multilingual chatbots — AI tools that can learn across many different languages and be able to identify common patterns across all. Caicedo says that is a very close parallel to trying to analyze cellular data en masse — each data set is almost like its own language, captured across widely divergent resolutions, and each focused on different cell functions.
“One experiment pulled together almost 100 languages — everything from Japanese, Hebrew, Spanish and English — all mixed together,” Caicedo says. “In principle, they don’t have much in common. But the machine learning model was able to find the structures or patterns that are recurrent in every language, because even though they have major differences, they also have a lot of things in common. “
“And we believe that the same can happen with imaging experiments, where they may not seem compatible in the beginning,” he adds. “But we think that putting all of them together is going to help us identify patterns that are common across imaging experiments. And that’s going to help us have a more robust and accurate type of model.”
Caicedo says the platform will primarily be valuable in basic research and preclinical settings, where quantifying complex phenotypes is crucial to advance the development of treatments or to identify the functional impact of interventions.
“In many applications in cellular biology, what we need to know is whether the phenotype is in one or another category. Is it a healthy phenotype? Or is it a disease phenotype? And getting that right determines the success of the experiments. We think that these large-scale systems will help us do that more accurately and easily,” he says.
Caicedo received NSF funding in 2022 to develop the project together with colleagues in Boston University. He has also partnered with his peers in Colombia University, Meta AI, and the University of Helsinki to tackle this challenge. The project has produced a few papers that describe their progress, including two presented in December 2023 in the largest machine learning conference worldwide.
Caicedo welcomes the contributions of scientists who have diverse cell image databases and would be interested in sharing the data for training purposes. They can contact Caicedo. His goal is to have preliminary findings on the efficacy of the approach completed in 2024.