SpatialDINO
A 3D self-supervised vision transformer for label-free segmentation and tracking of subcellular dynamics in lattice light-sheet microscopy (LLSM). Adapts DINO-style student/teacher contrastive learning natively into 3D — student/teacher 3D ViTs trained over volumetric LLSM crops with 3D iBOT block masking. Pre-trained on 2.4 TB / 180k volumes across 24 NVIDIA A100s using PyTorch DDP, bf16 mixed precision, NVLink intra-node and InfiniBand inter-node. On downstream subcellular structure prediction it outperformed a prior approach co-led by Nobel laureate Eric Betzig. Released as a BioRxiv preprint, first-author; full engineering log at /knowledge/ai/spatialdino-lessons. The work also produced a Rendezvous backend fix to PyTorch (PR #144779) that unblocked multi-node training over InfiniBand.
Highlights
- Native 3D student/teacher ViT with 3D iBOT block masking — no 2D-stack hacks
- KMeans content-aware 3D cropping and a 3D adaptation of SINDER for singular-defect repair
- No-positional-encoding 3D ViTs (NoPE) — let attention learn structure from scratch
- Streaming encoder with token-store + online softmax for full-volume inference at million-token sequence lengths
- Pre-trained on 2.4 TB / 180k volumes across 24 A100s with DDP + bf16
- Beat the prior SOTA (Nobel-laureate-led) on downstream subcellular structure prediction
- Surfaced a PyTorch Rendezvous backend bug, filed and contributed PR #144779
Tech
The canonical source for this project is on GitHub.
View on GitHub