This paper presents Audio-Visual LLM, a Multimodal Large Language Model that
takes both visual and auditory inputs for holistic video understanding. A key
design is the modality-augmented training, which involves the integration of
modality-specific tokens engineered to activate the appropriate visual and/or
auditory encoder selectively. This mechanism is pivotal in enabling end-to-end
joint training with video data at different modalities, including visual-only,
audio-only, and audio-visual formats. Moreover, we introduce a high-quality
video instruction dataset, derived from GPT-4. This dataset allows Audio-Visual
LLM to adeptly process a variety of task-oriented video instructions, ranging
from multi-turn conversations and audio-visual narratives to complex reasoning
tasks. Extensive experiments demonstrate that Audio-Visual LLM impressively
achieves strong zero-shot results across a range of video understanding tasks.
For example, Audio-Visual LLM achieves an accuracy of 53.7% on MSRVTT-QA,
outperforming non-LLM-based InterVideo by 6.6% and LLM-based Valley by 4.4%,
respectively. Additionally, our Audio-Visual LLM also achieves competitive
performance on audio tasks (e.g., AudioCaps).

该论文介绍了一种名为 Audio-Visual LLM 的多模态大型语言模型，它通过同时接收视觉和听觉输入来进行综合视频理解。该模型的关键设计是模态增强训练，它通过集成专门设计的模态特定标记来有选择地激活适当的视觉和 / 或听觉编码器。此机制对于实现端到端的多模态视频数据联合训练至关重要。实验证明，Audio-Visual LLM 在各种视频理解任务中取得了令人印象深刻的零样本结果。