This paper presents Audio-Visual LLM, a Multimodal Large Language Model that
takes both visual and auditory inputs for holistic video understanding. A key
design is the modality-augmented training, which involves the integration of
modality-specific tokens engineered to activate the appropriate visual and/or
auditory encoder selectively. This mechanism is pivotal in enabling end-to-end
joint training with video data at different modalities, including visual-only,
audio-only, and audio-visual formats. Moreover, we introduce a high-quality
video instruction dataset, derived from GPT-4. This dataset allows Audio-Visual
LLM to adeptly process a variety of task-oriented video instructions, ranging
from multi-turn conversations and audio-visual narratives to complex reasoning
tasks. Extensive experiments demonstrate that Audio-Visual LLM impressively
achieves strong zero-shot results across a range of video understanding tasks.
For example, Audio-Visual LLM achieves an accuracy of 53.7% on MSRVTT-QA,
outperforming non-LLM-based InterVideo by 6.6% and LLM-based Valley by 4.4%,
respectively. Additionally, our Audio-Visual LLM also achieves competitive
performance on audio tasks (e.g., AudioCaps).

该论文介绍了一种名为 Audio-Visual LLM 的多模态大型语言模型，它通过同时接收视觉和听觉输入来进行综合视频理解。该模型的关键设计是模态增强训练，它通过集成专门设计的模态特定标记来有选择地激活适当的视觉和 / 或听觉编码器。此机制对于实现端到端的多模态视频数据联合训练至关重要。实验证明，Audio-Visual LLM 在各种视频理解任务中取得了令人印象深刻的零样本结果。

音视频 LLM 用于视频理解

Audio-Visual LLM for Video Understanding

Retrieval augmentation, which enhances downstream models by a knowledge
retriever and an external corpus instead of by merely increasing the number of
model parameters, has been successfully applied to many natural language
processing (NLP) tasks such as text classification, question answering and so
on. However, existing methods that separately or asynchronously train the
retriever and downstream model mainly due to the non-differentiability between
the two parts, usually lead to degraded performance compared to end-to-end
joint training.

检索增强通过知识检索器和外部语料库提高下游模型的性能，成功应用于许多自然语言处理（NLP）任务，然而，现有的方法由于两个部分之间的不可微性而分别或异步训练检索器和下游模型，通常导致性能下降，与端到端联合训练相比。