[R] Vision Transformers with Self-Distilled Registers, NeurIPS 2025

So sharing some of our work we published at NeurIPS 2025 as a Spotlight.

Weights and code are public (see ArXiv).

TL;DR: Vision Transformers typically have artifacts in their dense features. While the exact reason is unknown, there is consensus that adding so called “register” tokens mitigates this issue. These tokens participate in the self-attention process, but are not used for the output.

When introduced with DINOv2 models in ICLR 2024, this requires vision transformers to be trained from scratch — which obviously most people cannot afford.

We show that you can actually get the benefits of registers pretty cheaply with existing pre-trained models without ANY labeled images. You can leverage the semantic invariance of images under shift & left-right flip (most natural images, obviously don’t flip images that contain text). We simply randomly augment the image multiple times, pad the borders with white, and un-shift/un-flip the dense features, and average over augmentations to use as a distillation target.

Surprisingly this extremely simple approach (Post Hoc Registers, PH-Reg) improves dense features for segmentation and depth across all datasets compared to both the student and the non-augmented teacher.

Our results are better than traditional attention modifications (MaskCLIP — ECCV 22, SCLIP — ECCV 24, ClearCLIP — ECCV 24, NACLIP — WACV 25), and much cheaper than Denoising Vision Transformers since we don’t need to utilize neural fields. Our results introduce minimal additional parameters compared to the original model.

submitted by /u/44seconds
[link] [comments]

Liked Liked