AI personal assistants deployed via robots or wearables require embodied
understanding to collaborate with humans effectively. However, current
Vision-Language Models (VLMs) primarily focus on third-person view videos,
neglecting the richness of egocentric perceptual experience. To address this
gap, we propose three key contributions. First, we introduce the Egocentric
Video Understanding Dataset (EVUD) for training VLMs on video captioning and
question answering tasks specific to egocentric videos. Second, we present
AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD.
Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging
benchmark for embodied video question answering. Our model achieves
state-of-the-art performance, outperforming open-source models including strong
Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform
Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to
Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning.
This research paves the way for building efficient VLMs that can be deployed in
robots or wearables, leveraging embodied video understanding to collaborate
seamlessly with humans in everyday tasks, contributing to the next generation
of Embodied AI

通过三个主要贡献，我们提出了能够与人类高效协作的 AI 个人助手的机器人或穿戴设备部署需要体现理解。为了填补当前视觉 - 语言模型在第三人视角视频上的研究的空白，我们引入了以自我中心感知经验为特定的视角视频字幕和问题回答任务训练视觉 - 语言模型的自我中心视频理解数据集（EVUD）。然后，我们提出了通过 EVUD 上的参数高效方法训练的 7B 参数的 AlanaVLM。最后，我们评估了 AlanaVLM 在 OpenEQA 上的能力，即一个对于具有挑战性的实体视频问题回答的基准测试。我们的模型达到了最先进的性能，超过了包括使用 GPT-4 作为规划者的强 Socratic 模型在内的开源模型 3.6%。此外，我们超越了 Claude 3 和 Gemini Pro Vision 1.0，与 Gemini Pro 1.5 和 GPT-4V 相比展示了竞争性的结果，甚至在空间推理上超过了后者。这项研究为构建能够在机器人或穿戴设备中部署的高效视觉 - 语言模型铺平了道路，利用体现理解的视频理解，无缝地与人类协作进行日常任务，为下一代具有体现 AI 特性的技术做出贡献。