In this paper, we present the VideoLLaMA 2, a set of Video Large Language
Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio
understanding in video and audio-oriented tasks. Building upon its predecessor,
VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC)
connector, which effectively captures the intricate spatial and temporal
dynamics of video data. Additionally, we integrate an Audio Branch into the
model through joint training, thereby enriching the multimodal understanding
capabilities of the model by seamlessly incorporating audio cues. Comprehensive
evaluations on multiple-choice video question answering (MC-VQA), open-ended
video question answering (OE-VQA), and video captioning (VC) tasks demonstrate
that VideoLLaMA 2 consistently achieves competitive results among open-source
models and even gets close to some proprietary models on several benchmarks.
Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and
audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models.
These advancements underline VideoLLaMA 2's superior performance in multimodal
comprehension, setting a new standard for intelligent video analysis systems.
All models are public to facilitate further research.

本论文介绍了一种名为 VideoLLaMA 2 的视频大型语言模型，它通过嵌入空间 - 时间卷积 (STC) 连接器和联合训练音频分支来增强视频和音频任务中的空间 - 时间建模和音频理解能力，并在多个任务上展示了竞争性结果，进一步提升了多模态理解能力，为智能视频分析系统设定了新的标准。

VideoLLaMA 2: 在视频 LLMs 中推进时空建模与音频理解

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio  Understanding in Video-LLMs

The great success of Large Language Models (LLMs) has expanded the potential
of multimodality, contributing to the gradual evolution of General Artificial
Intelligence (AGI). A true AGI agent should not only possess the capability to
perform predefined multi-tasks but also exhibit emergent abilities in an
open-world context. However, despite the considerable advancements made by
recent multimodal LLMs, they still fall short in effectively unifying
comprehension and generation tasks, let alone open-world emergent abilities. We
contend that the key to overcoming the present impasse lies in enabling text
and images to be represented and processed interchangeably within a unified
autoregressive Transformer. To this end, we introduce SEED, an elaborate image
tokenizer that empowers LLMs with the ability to SEE and Draw at the same time.
We identify two crucial design principles: (1) Image tokens should be
independent of 2D physical patch positions and instead be produced with a 1D
causal dependency, exhibiting intrinsic interdependence that aligns with the
left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens
should capture high-level semantics consistent with the degree of semantic
abstraction in words, and be optimized for both discriminativeness and
reconstruction during the tokenizer training phase. With SEED tokens, LLM is
able to perform scalable multimodal autoregression under its original training
recipe, i.e., next-word prediction. SEED-LLaMA is therefore produced by
large-scale pretraining and instruction tuning on the interleaved textual and
visual data, demonstrating impressive performance on a broad range of
multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has
exhibited compositional emergent abilities such as multi-turn in-context
multimodal generation, acting like your AI assistant.

通过引入 SEED 图像标记器，使 LLMs 能够在其原始训练配方下执行可扩展的多模式自回归，并在广泛的多模式理解和生成任务中展示出令人印象深刻的性能。

用 SEED 令牌化器使 LLaMA 具备视觉和绘图能力

Making LLaMA SEE and Draw with SEED Tokenizer

Understanding and reasoning about cooking recipes is a fruitful research
direction towards enabling machines to interpret procedural text. In this work,
we introduce RecipeQA, a dataset for multimodal comprehension of cooking
recipes. It comprises of approximately 20K instructional recipes with multiple
modalities such as titles, descriptions and aligned set of images. With over
36K automatically generated question-answer pairs, we design a set of
comprehension and reasoning tasks that require joint understanding of images
and text, capturing the temporal flow of events and making sense of procedural
knowledge. Our preliminary results indicate that RecipeQA will serve as a
challenging test bed and an ideal benchmark for evaluating machine
comprehension systems. The data and leaderboard are available at
this http URL

本文介绍了适用于多模态理解和推理任务的 “RecipeQA” 数据集，其中包含大约 20,000 个有多个模态（如标题，描述和一组对齐的图像）的烹饪配方的指令，与其对应的 36,000 多个问题答案对。我们利用自动生成的问题，设计了一组需要对图像和文本进行联合理解，捕捉事件的时间流和理解流程知识的任务。该数据集是衡量计算机理解系统的理想基准，并提供数据和排行榜。