Accurate affordance detection and segmentation with pixel precision is an important piece in many complex systems based on interactions, such as robots and assitive devices. We present a new approach to affordance perception which enables accurate multi-label segmentation. Our approach can be used to automatically extract grounded affordances from first person videos of interactions using a 3D map of the environment providing pixel level precision for the affordance location. We use this method to build the largest and most complete dataset on affordances based on the EPIC-Kitchen dataset, EPIC-Aff, which provides interaction-grounded, multi-label, metric and spatial affordance annotations. Then, we propose a new approach to affordance segmentation based on multi-label detection which enables multiple affordances to co-exists in the same space, for example if they are associated with the same object. We present several strategies of multi-label detection using several segmentation architectures. The experimental results highlight the importance of the multi-label detection. Finally, we show how our metric representation can be exploited for build a map of interaction hotspots in spatial action-centric zones and use that representation to perform a task-oriented navigation.

准确的多标签感知能力和分割是基于交互的许多复杂系统中的重要组成部分。我们提出了一种新的感知能力方法，它能够实现准确的多标签分割。该方法可以从交互的第一人称视频中自动提取基于环境的感知能力，并提供感知能力位置的像素级精度。使用此方法构建了基于EPIC-Kitchen数据集的最大且最完整的感知能力数据集EPIC-Aff，其中提供了基于交互、多标签、度量和空间感知能力注释。然后，我们提出了一种基于多标签检测的感知能力分割方法，可以使多个感知能力在同一空间中存在，例如与相同对象相关联。我们提出了几种使用多种分割架构的多标签检测策略。实验结果凸显了多标签检测的重要性。最后，我们展示了如何利用我们的度量表示来构建空间行为中心区域的交互热点地图，并使用该表示执行任务导向的导航。

多标签自我中心视觉感知映射