unsupervised multi-object segmentation has shown impressive results on images
by utilizing powerful semantics learned from self-supervised pretraining. An
additional modality such as depth or motion is often used to facilitate the
segmentation in video sequences. However, the performan