DINOSAUR: Bridging the Gap to Real-World Object-Centric Learning

Maximilian Seitzer¹ Max Horn² Andrii Zadaianchuk¹ Dominik Zietlow² Tianjun Xiao² Carl-Johann Simon-Gabriel² Tong He² Zheng Zhang² Bernhard Schölkopf² Thomas Brox² Francesco Locatello²

¹Max Planck Institute for Intelligent Systems, Tübingen
²Amazon AWS, Tübingen

Published at ICLR 2023

OpenReview PDF arXiv Code Contact

Summary

DINOSAUR is a model for unsupervised object-centric representation learning. This means that it learns a set representation that binds different entities in the input to separate vectors in the representation, purely from images alone. DINOSAUR achieves this by making use of the DINO method and other self-supervised learning techniques. While previous previous object-centric learning methods were mostly constrained to simple, synthetic datasets, DINOSAUR enables, for the first time, learning object-centric representations on complex, real-world images.

Model

The DINOSAUR model. The image is processed into a set of patch features by a frozen DINO ViT model (pre-trained using the self-supervised DINO method) and encoded either by the same DINO ViT or a separate encoder network. Slot attention groups the encoded features into a set of slots. The model is trained by reconstructing the DINO features from the slots.

Examples

DINOSAUR examples on COCO.

DINOSAUR examples on MOVi-C.

DINOSAUR examples on MOVi-E.

Reference Results for Object Discovery

This section is supposed to serve as a reference for the results different versions of DINOSAUR can achieve on the object discovery task. Note that the numbers vary quite a bit depending on the encoder, the decoder, the number of slots and other hyperparameters. In particular, the MLP decoder generally achieves higher FG-ARI than the Transformer decoder, and vice-versa with mBO. If you reimplement DINOSAUR, your model should be able to achieve similar numbers as the ones in the table. If you compare with DINOSAUR in your own work, make sure you either have a proper reimplementation, or simply directly use the numbers stated here. It is generally advisable to use the strongest version of a baseline in a given reference class.

We split the results into two parts: results from the original DINOSAUR paper, and results from follow-up work, using the DINOSAUR architecture but potentially achieving stronger results. If you would like to add your DINOSAUR-based results to this table, feel free to send us an email. We only accept results that have a reproducible code implementation associated.

DINOSAUR Paper

Dataset	Encoder	Decoder	#Slots	FG-ARI	mBO	Reference
MOVi-C¹	ViT-B/16, DINO	MLP	11	66.0	35.0	Table 3
MOVi-C¹	ViT-B/16, DINO	Transformer	11	55.7	42.4	Table 3
MOVi-C¹	ViT-S/8, DINO	MLP	11	67.2	38.6	Figure 3
MOVi-C¹	ViT-B/8, DINO	MLP	11	68.6	39.1	Figure 3
MOVi-E¹	ViT-S/8, DINO	MLP	11	76.7	29.7	Table 13
MOVi-E¹	ViT-S/8, DINO	MLP	24	64.7	34.1	Figure 3
MOVi-E¹	ViT-B/8, DINO	MLP	11	79.3	32.7	Table 13
MOVi-E¹	ViT-B/8, DINO	MLP	24	65.1	35.5	Figure 3
COCO²	ViT-S/16, DINO	Transformer	7	36.9	29.7	Table 11
COCO²	ViT-B/16, DINO	MLP	7	40.5	27.7	Table 12
COCO²	ViT-B/16, MAE	MLP	7	42.3	29.1	Table 12
COCO²	ViT-B/16, DINO	Transformer	7	34.1	31.6	Figure 5
COCO²	ViT-S/8, DINO	Transformer	7	34.3	32.3	Table 11
PASCAL 2012²	ViT-B/16, DINO	MLP	6	24.6	39.5	Table 3
PASCAL 2012²	ViT-B/16, DINO	Transformer	6	24.8	44.0	Figure 5

Follow-up Work

Dataset	Encoder	Decoder	#Slots	FG-ARI	mBO	Reference
COCO³	ViT-B/14, DINOv2	MLP	7	45.6	29.6	VideoSAUR code release

¹: 128x128 mask resolution
²: 320x320 mask resolution, central crops
³: 224x224 mask resolution, central crops

Implementations

There are currently two officially sanctioned implementations:

Object-centric Learning Framework: Official codebase used for producing the DINOSAUR experiments
VideoSAUR: Codebase for the follow-up project VideoSAUR, also containing a DINOSAUR implementation

Related Projects

VideoSAUR (NeurIPS 2023): object-centric representations for real-world videos using a DINOSAUR-style framework and a novel temporal similarity loss

BibTeX


            @inproceedings{seitzer2023bridging,
              title={Bridging the Gap to Real-World Object-Centric Learning},
              author={Maximilian Seitzer and Max Horn and Andrii Zadaianchuk and Dominik Zietlow and Tianjun Xiao and Carl-Johann Simon-Gabriel and Tong He and Zheng Zhang and Bernhard Sch{\"o}lkopf and Thomas Brox and Francesco Locatello},
              booktitle={The Eleventh International Conference on Learning Representations},
              year={2023},
              url={https://openreview.net/forum?id=b9tUk-f_aG}
            }