A dataset for powering multi-channel foundational models of cell biology
Multi-Channel Images
Biological Studies
Channel Types
Segmented Cells
CHAMMI-75 is a large and highly diverse curated dataset of multi-channel microscopy images designed to advance foundation models for cellular biology. With 2.8 million images from 75 studies, CHAMMI-75 amasses in a single resource a level of technical and biological variation once thought unreachable but now widely appreciated as the key to training foundation models of the cell biology universe. Such models will transform the way we ask and answer questions about disease.
Our vision is that foundation models should be able to understand cell morphology at all scales, in all cell types, and independent of microscopy technique. In a systematic evaluation using six benchmarks, we demonstrate that CHAMMI-75 pre-training yields an AI vision model that can generalize to novel channel combinations and imaging modalities, outperforming other state-of-the-art models in a wide range of tasks, from localizing proteins to classifying red blood cells to screening for disease.
Initial dataset of 26.7M images produced by a wide array of study types and stored in 18 different hosting platforms — 97% of contributing datasets have Creative Commons licenses, 65% CC BY 4.0
Training on diverse microscopy types — brightfield, fluorescence, confocal, cryo-EM, and more — proves a critical driver of model performance, more important than data volume alone
Systematic sampling of metadata fields to curate data, annotation of 1.8B single cells to maximize meaningful image information; normalization of all images to 8-bit PNG with lossless compression to enhance ease of use
22 metadata fields organized into 6 groups for comprehensive bioimaging context — including 223 cell lines, mostly human (biology) and 14 imaging modalities (microscopy)
The emergence of multi-channel AI vision models heralds a new future for deep learning-driven life sciences. But existing initiatives have been constrained by scale. Technical differences and inconsistent metadata across the millions of publicly available multi-channel microscopy images make collecting and curating a single resource a formidable undertaking.
We set out to create a representative visual sample of cell biology as seen under the microscope over three phases — data acquisition, curation, and annotation — expanding on our earlier effort to begin testing channel-adaptive models against fixed baseline models. The paradigm under which most AI vision models make sense of color requires a fixed number of channels, akin to the three channels making up the RGB colorspace. CHAMMI stands for channel-adaptive models in microscopy imaging — and instead treats the multiple, varied channels inherent to microscopy images, each encoding a different meaningful signal, as a virtue to be leveraged.
Ultimately, breaking the fixed-channel bottleneck opens AI vision models to a wealth of technical and biological diversity that drives remarkably robust model performance. We systematically benchmarked our model, MorphEm, against existing models pre-trained on other state-of-the-art datasets on a variety of real-world tasks and it consistently outperformed.
Access the complete dataset, pre-trained models, and training code to advance your research in cellular morphology.
Read the Paper
Download Dataset
View on GitHub
Download Model
The CHAMMI-75 team spans eight institutions and is led by Juan Caicedo at the Morgridge Institute for Research and University of Wisconsin–Madison, where the Center for High Throughput Computing delivers expertise and compute capacity at the requisite scale.
Vidit Agrawal | John Peters | Tyler N. Thompson | Arshad Kazi | Aditya Pillai | Juan Caicedo
Morgridge Institute for Research and University of Wisconsin–Madison
Chau Pham | Bryan A. Plummer
Boston University
Mohammad Vali Sanian | Lassi Paavolainen
Institute for Molecular Medicine Finland and University of Helsinki
Nikita Moshkov
Helmholtz Munich
Jack Freeman | Ron Stewart
Morgridge Institute for Research
Byunguk Kang | Samouil L. Farhi | Ernest Fraenkel
Broad Institute of MIT and Harvard and Massachusetts Institute of Technology