Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed a broad coverage of the video-based referring expression comprehension dataset: RefEgo. Our dataset includes more than 12k video clips and 41 hours for video-based referring expression comprehension annotation. In experiments, we combine the state-of-the-art 2D referring expression comprehension models with the object tracking algorithm, achieving the video-wise referred object tracking even in difficult conditions: the referred object becomes out-of-frame in the middle of the video or multiple similar objects are presented in the video.

从第一人称视角将文本表达与场景对象联系起来是开发具有环境意识并按照直观的文字指令行动的代理人的一项真正具有挑战性的能力。本文基于Ego4D的第一人称视频构建了广泛的基于视频的引用表达理解数据集：RefEgo，其中包括超过12k个视频剪辑和41小时的视频引用表达理解批注。通过将最先进的2D引用表达理解模型与对象跟踪算法相结合，我们实现了视频中对象的跟踪，即使在困难条件下：视频中的所指对象在视频中间变得超出视野或者视频中出现多个相似对象。

RefEgo: 第一人称自我感知的指称表达理解数据集