We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process, which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information, including storylines and character identities, ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator, we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios, as measured by standard evaluation metrics. Additionally, we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4, this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.

提出了一种利用GPT-4进行多模态上下文学习的新系统——MM-Narrator，用于音频描述的生成。通过提出的记忆增强生成过程，该系统能够在自回归方式下生成准确的音频描述，即使是超过数小时的长视频。MM-Narrator还采用复杂度为基础的演示选择策略，通过少样本的多模态上下文学习（MM-ICL）大大增强了其多步推理能力。在MAD-eval数据集上进行的实验结果表明，MM-Narrator在大多数情况下都优于现有的基于微调和基于LLM的方法，在标准评估指标下得分更高。此外，还引入了首个基于片段的重复文本生成评估器，该评估器通过GPT-4全面推理和评估音频描述生成的性能。

MM-Narrator: 多模态上下文学习中的长视频叙事