Affordance denotes the potential interactions inherent in objects. The perception of affordance can enable intelligent agents to navigate and interact with new environments efficiently. Weakly supervised affordance grounding teaches agents the concept of affordance without costly pixel-level annotations, but with exocentric images. Although recent advances in weakly supervised affordance grounding yielded promising results, there remain challenges including the requirement for paired exocentric and egocentric image dataset, and the complexity in grounding diverse affordances for a single object. To address them, we propose INTeraction Relationship-aware weakly supervised Affordance grounding (INTRA). Unlike prior arts, INTRA recasts this problem as representation learning to identify unique features of interactions through contrastive learning with exocentric images only, eliminating the need for paired datasets. Moreover, we leverage vision-language model embeddings for performing affordance grounding flexibly with any text, designing text-conditioned affordance map generation to reflect interaction relationship for contrastive learning and enhancing robustness with our text synonym augmentation. Our method outperformed prior arts on diverse datasets such as AGD20K, IIT-AFF, CAD and UMD. Additionally, experimental results demonstrate that our method has remarkable domain scalability for synthesized images / illustrations and is capable of performing affordance grounding for novel interactions and objects.

本研究解决了弱监督效用基础扎根中缺乏配对外观图和自观图数据集的问题，以及在单一物体上基础多样化效用的复杂性。提出的INTRA方法通过对比学习只依赖外观图进行特征识别，消除了配对数据集的需求，并结合视觉-语言模型嵌入，可以灵活生成文本条件下的效用图。实验结果表明，该方法在多个数据集上表现优异，并在新交互和物体的效用扎根方面具有显著的领域可扩展性。

INTRA：基于交互关系的弱监督效用基础扎根