We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these activities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.

我们提出了Ego-Exo4D，这是一个多样化、大规模的多模态多视图视频数据集和基准挑战。Ego-Exo4D集中于同时捕捉到的技能人类活动（例如，体育运动、音乐、舞蹈、自行车修理）的自我中心和外部视角视频，通过来自全球13个城市的800多名参与者在131个不同的自然场景环境中进行了这些活动，每个活动的长时间录制为1到42分钟不等，总共获得了1,422小时的视频。该数据集具有前所未有的多模态特性：视频伴随着多通道音频、眼动数据、3D点云、相机姿态、IMU数据以及多个配对的语言描述，包括由教练和教师进行的针对技能活动领域的新颖的“专家评论”。为了推进对技能人类活动的第一人视角视频理解的研究前沿，我们还提出了一系列基准任务及其标注，包括细粒度活动理解、熟练度估计、跨视角转换和3D手/身体姿态。所有资源将以开源方式提供，以促进社区中的新研究。

Ego-Exo4D: 理解高技能人类活动的第一人称和第三人称视角