Despite tremendous advances in AI, it remains a significant challenge to develop interactive task guidance systems that can offer situated, personalized guidance and assist humans in various tasks. These systems need to have a sophisticated understanding of the user as well as the environment, and make timely accurate decisions on when and what to say. To address this issue, we created a new multimodal benchmark dataset, Watch, Talk and Guide (WTaG) based on natural interaction between a human user and a human instructor. We further proposed two tasks: User and Environment Understanding, and Instructor Decision Making. We leveraged several foundation models to study to what extent these models can be quickly adapted to perceptually enabled task guidance. Our quantitative, qualitative, and human evaluation results show that these models can demonstrate fair performances in some cases with no task-specific training, but a fast and reliable adaptation remains a significant challenge. Our benchmark and baselines will provide a stepping stone for future work on situated task guidance.

通过人类用户和人类导师之间的自然互动，我们创建了一个名为WTaG的多模态基准数据集，进而提出了用户与环境理解以及导师决策两个任务。我们利用多个基础模型研究这些模型在感知引导任务中可以快速适应的程度，并通过定量、定性和人工评估结果显示，这些模型在某些情况下可以表现出公正的性能，但快速可靠的适应仍然是一个重大挑战。我们的基准数据集和基线将为未来研究提供一个起点。

基础模型能否观看、交谈并逐步指导你烘焙蛋糕？