SpatialDINO Lessons: 3D SSL on Cryo-ET
2025-09-01
What we learned building the first 3D self-supervised vision transformer for subcellular cryo-electron tomography.
SpatialDINO was the first 3D self-supervised vision transformer applied to cryo-electron tomograms. The model learned subcellular structural representations without labels and beat a Nobel laureate–led approach that relied on hand-crafted geometric features.
What worked
- Volumetric ViT patches. Using 3D patches (rather than slice-then-stack) preserved the spatial coherence the network was meant to learn.
- DINOv2-style multi-crop self-distillation. Two student crops, four teacher crops, EMA teacher. The recipe transferred surprisingly well from natural images.
- Aggressive masking. 75% random masking forced the network to learn local structural context rather than memorize global appearance.
What didn't
- Full-volume passes. Tomograms are too large to fit in memory at native resolution. We sub-sampled, which costs detail in fine structures.
- Infiniband RDMA stalls. Caught a Rendezvous backend bug in PyTorch — filed #144779. Cost a week.
What's next
Diffusion-based denoising as a pretraining objective for cryo-ET is the obvious follow-up. The signal-to-noise floor of CET is much worse than natural images — pretraining the network to denoise gives it a stronger prior than masked autoencoding alone.