While current visual captioning models have achieved impressive performance,
they often assume that the image is well-captured and provides a complete view
of the scene. In real-world scenarios, however, a single image may not offer a
good viewpoint, hindering fine-grained scene understanding. To overcome this
limitation, we propose a novel task called Embodied Captioning, which equips
visual captioning models with navigation capabilities, enabling them to
actively explore the scene and reduce visual ambiguity from suboptimal
viewpoints. Specifically, starting at a random viewpoint, an agent must
navigate the environment to gather information from different viewpoints and
generate a comprehensive paragraph describing all objects in the scene. To
support this task, we build the ET-Cap dataset with Kubric simulator,
consisting of 10K 3D scenes with cluttered objects and three annotated
paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT),
which comprises of a navigator and a captioner, to tackle this task. The
navigator predicts which actions to take in the environment, while the
captioner generates a paragraph description based on the whole navigation
trajectory. Extensive experiments demonstrate that our model outperforms other
carefully designed baselines. Our dataset, codes and models are available at
this https URL

当前的视觉说明模型假设图像是完整呈现场景的完美捕捉，然而在真实世界场景中一个图像可能没有提供良好的视角，从而限制了对细粒度场景的理解。为了克服这一限制，我们提出了一项名为 “实体说明” 的新任务，将视觉说明模型与导航能力相结合，使其能够主动探索场景，并减少来自次优视角的视觉模糊。我们构建了一个包含 10K 个混乱物体的 3D 场景和每个场景三个注释段落的 ET-Cap 数据集，以支持该任务。我们提出了一个级联实体说明模型（CaBOT），它由导航器和说明器组成，用于处理这个任务。广泛的实验证明我们的模型优于其他精心设计的基线模型。我们的数据集、代码和模型可在此链接获得。