Audio-visual large language models (LLM) have drawn significant attention,
yet the fine-grained combination of both input streams is rather
under-explored, which is challenging but necessary for LLMs to understand
general video inputs. To this end, a fine-grained audio-visual joint
representation (FAVOR) learning framework for multimodal LLMs is proposed in
this paper, which extends a text-based LLM to simultaneously perceive speech
and audio events in the audio input stream and images or videos in the visual
input stream, at the frame level. To fuse the audio and visual feature streams
into joint representations and to align the joint space with the LLM input
embedding space, we propose a causal Q-Former structure with a causal attention
module to enhance the capture of causal relations of the audio-visual frames
across time. An audio-visual evaluation benchmark (AVEB) is also proposed which
comprises six representative single-modal tasks with five cross-modal tasks
reflecting audio-visual co-reasoning abilities. While achieving competitive
single-modal performance on audio, speech and image tasks in AVEB, FAVOR
achieved over 20% accuracy improvements on the video question-answering task
when fine-grained information or temporal causal reasoning is required. FAVOR,
in addition, demonstrated remarkable video comprehension and reasoning
abilities on tasks that are unprecedented by other multimodal LLMs. An
interactive demo of FAVOR is available at
this https URL, and the training code and model
checkpoints will be released upon acceptance.

通过提出细粒度的音视频联合表示学习框架 (FAVOR)，同时感知音频和视觉输入流中的语音、音频事件以及图像或视频，利用因果关注模块增强音视频帧之间的因果关系捕捉，在音频、语音和图像任务上取得了有竞争力的单模态性能，并在需要细粒度信息或时间因果推理的视频问答任务上实现了超过 20% 的准确度改进，表现出了出色的视频理解和推理能力。