TL;DR提出了一种具备多轮对话交互能力的视频检索框架,该框架包括 AI agent、多模态问答生成器及信息指导监督器,实验表明其效果显著优于传统非交互方式的视频检索系统。
Abstract
The majority of traditional text-to-video retrieval systems operate in static
environments, i.e., there is no interaction between the user and the agent
beyond the initial textual query provided by the user. This can be sub-optimal
if the initial query has ambiguities, which would lead