DINOSAUR: Bridging the Gap to Real-World Object-Centric Learning

Maximilian Seitzer1     Max Horn2     Andrii Zadaianchuk1     Dominik Zietlow2     Tianjun Xiao2     Carl-Johann Simon-Gabriel2     Tong He2     Zheng Zhang2     Bernhard Schölkopf2     Thomas Brox2     Francesco Locatello2
1Max Planck Institute for Intelligent Systems, Tübingen
2Amazon AWS, Tübingen
Published at ICLR 2023

Summary

DINOSAUR is a model for unsupervised object-centric representation learning. This means that it learns a set representation that binds different entities in the input to separate vectors in the representation, purely from images alone. DINOSAUR achieves this by making use of the DINO method and other self-supervised learning techniques. While previous previous object-centric learning methods were mostly constrained to simple, synthetic datasets, DINOSAUR enables, for the first time, learning object-centric representations on complex, real-world images.

Model

DINOSAUR model

The DINOSAUR model. The image is processed into a set of patch features by a frozen DINO ViT model (pre-trained using the self-supervised DINO method) and encoded either by the same DINO ViT or a separate encoder network. Slot attention groups the encoded features into a set of slots. The model is trained by reconstructing the DINO features from the slots.

Examples

Reference Results for Object Discovery

This section is supposed to serve as a reference for the results different versions of DINOSAUR can achieve on the object discovery task. Note that the numbers vary quite a bit depending on the encoder, the decoder, the number of slots and other hyperparameters. In particular, the MLP decoder generally achieves higher FG-ARI than the Transformer decoder, and vice-versa with mBO. If you reimplement DINOSAUR, your model should be able to achieve similar numbers as the ones in the table. If you compare with DINOSAUR in your own work, make sure you either have a proper reimplementation, or simply directly use the numbers stated here. It is generally advisable to use the strongest version of a baseline in a given reference class.

We split the results into two parts: results from the original DINOSAUR paper, and results from follow-up work, using the DINOSAUR architecture but potentially achieving stronger results. If you would like to add your DINOSAUR-based results to this table, feel free to send us an email. We only accept results that have a reproducible code implementation associated.

DINOSAUR Paper

Dataset Encoder Decoder #Slots FG-ARI mBO Reference
MOVi-C1 ViT-B/16, DINO MLP 11 66.0 35.0 Table 3
MOVi-C1 ViT-B/16, DINO Transformer 11 55.7 42.4 Table 3
MOVi-C1 ViT-S/8, DINO MLP 11 67.2 38.6 Figure 3
MOVi-C1 ViT-B/8, DINO MLP 11 68.6 39.1 Figure 3
MOVi-E1 ViT-S/8, DINO MLP 11 76.7 29.7 Table 13
MOVi-E1 ViT-S/8, DINO MLP 24 64.7 34.1 Figure 3
MOVi-E1 ViT-B/8, DINO MLP 11 79.3 32.7 Table 13
MOVi-E1 ViT-B/8, DINO MLP 24 65.1 35.5 Figure 3
COCO2 ViT-S/16, DINO Transformer 7 36.9 29.7 Table 11
COCO2 ViT-B/16, DINO MLP 7 40.5 27.7 Table 12
COCO2 ViT-B/16, MAE MLP 7 42.3 29.1 Table 12
COCO2 ViT-B/16, DINO Transformer 7 34.1 31.6 Figure 5
COCO2 ViT-S/8, DINO Transformer 7 34.3 32.3 Table 11
PASCAL 20122 ViT-B/16, DINO MLP 6 24.6 39.5 Table 3
PASCAL 20122 ViT-B/16, DINO Transformer 6 24.8 44.0 Figure 5

Follow-up Work

Dataset Encoder Decoder #Slots FG-ARI mBO Reference
COCO3 ViT-B/14, DINOv2 MLP 7 45.6 29.6 VideoSAUR code release

1: 128x128 mask resolution
2: 320x320 mask resolution, central crops
3: 224x224 mask resolution, central crops

Implementations

There are currently two officially sanctioned implementations:

Related Projects

  • VideoSAUR (NeurIPS 2023): object-centric representations for real-world videos using a DINOSAUR-style framework and a novel temporal similarity loss

BibTeX


            @inproceedings{seitzer2023bridging,
              title={Bridging the Gap to Real-World Object-Centric Learning},
              author={Maximilian Seitzer and Max Horn and Andrii Zadaianchuk and Dominik Zietlow and Tianjun Xiao and Carl-Johann Simon-Gabriel and Tong He and Zheng Zhang and Bernhard Sch{\"o}lkopf and Thomas Brox and Francesco Locatello},
              booktitle={The Eleventh International Conference on Learning Representations},
              year={2023},
              url={https://openreview.net/forum?id=b9tUk-f_aG}
            }