2.8M

Multi-Channel Images

75

Biological Studies

25

Channel Types

1.8B

Segmented Cells

CHAMMI-75 is a large and highly diverse curated dataset of multi-channel microscopy images designed to advance foundation models for cellular biology. With 2.8 million images from 75 studies, CHAMMI-75 amasses in a single resource a level of technical and biological variation once thought unreachable but now widely appreciated as the key to training foundation models of the cell biology universe. Such models will transform the way we ask and answer questions about disease.

Our vision is that foundation models should be able to understand cell morphology at all scales, in all cell types, and independent of microscopy technique. In a systematic evaluation using six benchmarks, we demonstrate that CHAMMI-75 pre-training yields an AI vision model that can generalize to novel channel combinations and imaging modalities, outperforming other state-of-the-art models in a wide range of tasks, from localizing proteins to classifying red blood cells to screening for disease.

Read the paper.

Channel Types
Relative to existing datasets used to train models to learn cell morphology, CHAMMI-75 combines the highest number of channel types — nuclei, cell bodies, and so on — with a massive number of multi-channel images, 2.8 million. Circle size corresponds to number of sources. Orange circles represent multi-channel datasets, blue circles fixed channel datasets, and green circles datasets with varied channels where channel order and information are not preserved. Open circles are public; closed circle is private.

Dataset Highlights

Broad coverage of biology

Initial dataset of 26.7M images produced by a wide array of study types and stored in 18 different hosting platforms — 97% of contributing datasets have Creative Commons licenses, 65% CC BY 4.0

CHAMMI 75 Sources
CHAMMI 75 Modality Diversity

Modality diversity matters

Training on diverse microscopy types — brightfield, fluorescence, confocal, cryo-EM, and more — proves a critical driver of model performance, more important than data volume alone

Highest data quality for machine learning research

Systematic sampling of metadata fields to curate data, annotation of 1.8B single cells to maximize meaningful image information; normalization of all images to 8-bit PNG with lossless compression to enhance ease of use

CHAMMI 75 Performance
CHAMMI 75 Metadata

Comprehensive metadata

22 metadata fields organized into 6 groups for comprehensive bioimaging context — including 223 cell lines, mostly human (biology) and 14 imaging modalities (microscopy)

Building and Benchmarking CHAMMI-75

The emergence of multi-channel AI vision models heralds a new future for deep learning-driven life sciences. But existing initiatives have been constrained by scale. Technical differences and inconsistent metadata across the millions of publicly available multi-channel microscopy images make collecting and curating a single resource a formidable undertaking.

We set out to create a representative visual sample of cell biology as seen under the microscope over three phases — data acquisition, curation, and annotation — expanding on our earlier effort to begin testing channel-adaptive models against fixed baseline models. The paradigm under which most AI vision models make sense of color requires a fixed number of channels, akin to the three channels making up the RGB colorspace. CHAMMI stands for channel-adaptive models in microscopy imaging — and instead treats the multiple, varied channels inherent to microscopy images, each encoding a different meaningful signal, as a virtue to be leveraged.

Ultimately, breaking the fixed-channel bottleneck opens AI vision models to a wealth of technical and biological diversity that drives remarkably robust model performance. We systematically benchmarked our model, MorphEm, against existing models pre-trained on other state-of-the-art datasets on a variety of real-world tasks and it consistently outperformed.

Start Building with CHAMMI-75

Access the complete dataset, pre-trained models, and training code to advance your research in cellular morphology.

Read the Paper
Download Dataset
View on GitHub
Download Model

Our Team

The CHAMMI-75 team spans eight institutions and is led by Juan Caicedo at the Morgridge Institute for Research and University of Wisconsin–Madison, where the Center for High Throughput Computing delivers expertise and compute capacity at the requisite scale.

Vidit Agrawal | John Peters | Tyler N. Thompson | Arshad Kazi | Aditya Pillai | Juan Caicedo
Morgridge Institute for Research and University of Wisconsin–Madison

Chau Pham | Bryan A. Plummer
Boston University

Mohammad Vali Sanian | Lassi Paavolainen
Institute for Molecular Medicine Finland and University of Helsinki

Nikita Moshkov
Helmholtz Munich

Jack Freeman | Ron Stewart
Morgridge Institute for Research

Byunguk Kang | Samouil L. Farhi | Ernest Fraenkel
Broad Institute of MIT and Harvard and Massachusetts Institute of Technology

CHAMMI-75 is made possible thanks to generous support from

AWS
NSF
Meta