Frank F. Xu, Lei Ji, Botian Shi, Junyi Du, Graham Neubig...
TL;DR本文提出了从烹饪视频中提取结构化过程知识的基准测试,研究了现有模式的性能。
Abstract
Watching instructional videos are often used to learn about procedures. Video
captioning is one way of automatically collecting such knowledge. However, it
provides only an indirect, overall evaluation of multimodal models with no
finer-grained quantitative measure of what they have le