In this work, we propose the use of "aligned visual captions" as a mechanism for integrating information contained within videos into retrieval augmented generation (RAG) based chat assistant systems. These captions are able to describe the visual and audio content of videos in a large corpus while having the advantage of being in a textual format that is both easy to reason about & incorporate into large language model (LLM) prompts, but also typically require less multimedia content to be inserted into the multimodal LLM context window, where typical configurations can aggressively fill up the context window by sampling video frames from the source video. Furthermore, visual captions can be adapted to specific use cases by prompting the original foundational model / captioner for particular visual details or fine tuning. In hopes of helping advancing progress in this area, we curate a dataset and describe automatic evaluation procedures on common RAG tasks.

我们提出使用“对齐视觉标题”作为一种机制，将视频中的信息整合到基于检索增强生成的聊天助手系统中，这些标题能够以文本形式描述视频的视觉和音频内容，并且易于理解和加入到大型语言模型的提示中，同时也需要较少的多媒体内容来插入到多模态语言模型的上下文窗口中，我们还为常见的检索增强生成任务构建了一个数据集并描述了自动评估程序以促进该领域的进展。

利用对齐的视频字幕增强的视频富文本检索生成