Learning compositional representation is a key aspect of object-centric learning as it enables flexible systematic generalization and supports complex visual reasoning. However, most of the existing approaches rely on auto-encoding objective, while the compositionality is implicitly imposed by the architectural or algorithmic bias in the encoder. This misalignment between auto-encoding objective and learning compositionality often results in failure of capturing meaningful object representations. In this study, we propose a novel objective that explicitly encourages compositionality of the representations. Built upon the existing object-centric learning framework (e.g., slot attention), our method incorporates additional constraints that an arbitrary mixture of object representations from two images should be valid by maximizing the likelihood of the composite data. We demonstrate that incorporating our objective to the existing framework consistently improves the objective-centric learning and enhances the robustness to the architectural choices.

学习组合表示是目标为中心的学习的关键方面，它实现了灵活的系统化推广并支持复杂的视觉推理。然而，大多数现有方法依赖于自编码目标，而复杂性通常是由编码器中的架构或算法偏差隐含地施加的。本研究中，我们提出了一种新的目标，明确促进这些表示的复杂性。我们的方法基于现有的目标为中心的学习框架（例如，槽关注）构建，并加入了额外的约束，使得来自两幅图像的任意对象表示混合有效，通过最大化复合数据的似然性。我们证明将我们的目标融入现有框架可以持续改善客观为中心的学习，并增强对架构选择的鲁棒性。

学习组合：通过注入组合性提高对象中心学习