Despite the rapid development of video Large Language Models (LLMs), a
comprehensive evaluation is still absent. In this paper, we introduce a unified
evaluation that encompasses multiple video tasks, including captioning,
question and answering, retrieval, and action recognition. In addition to
conventional metrics, we showcase how GPT-based evaluation can match human-like
performance in assessing response quality across multiple aspects. We propose a
simple baseline: Video-LLaVA, which uses a single linear projection and
outperforms existing video LLMs. Finally, we evaluate video LLMs beyond
academic datasets, which show encouraging recognition and reasoning
capabilities in driving scenarios with only hundreds of video-instruction pairs
for fine-tuning. We hope our work can serve as a unified evaluation for video
LLMs, and help expand more practical scenarios. The evaluation code will be
available soon.

本文提出了一个统一的评估方法，包括字幕、问答、检索和行动识别等多个视频任务，展示了基于 GPT 的评估方法在多个方面可以与人类一样的表现，同时也展示了一种简单的基准方法 Video-LLaVA，在评估视频 LLMs 时优于现有方法。此外，我们还在实际驾驶场景中评估了视频 LLMs 的有效性，并展示了令人鼓舞的识别和推理能力。希望我们的工作能为视频 LLMs 提供一个统一的评估方法，并帮助扩展更多实际应用场景。

VLM-Eval: 视频大型语言模型的通用评估

VLM-Eval: A General Evaluation on Video Large Language Models

Recent advances in Natural Language Processing, and in particular on the
construction of very large pre-trained language representation models, is
opening up new perspectives on the construction of conversational information
seeking (CIS) systems. In this paper we investigate the usage of in-context
learning and pre-trained language representation models to address the problem
of information extraction from process description documents, in an incremental
question and answering oriented fashion. In particular we investigate the usage
of the native GPT-3 (Generative Pre-trained Transformer 3) model, together with
two in-context learning customizations that inject conceptual definitions and a
limited number of samples in a few shot-learning fashion. The results highlight
the potential of the approach and the usefulness of the in-context learning
customizations, which can substantially contribute to address the "training
data challenge" of deep learning based NLP techniques the BPM field. It also
highlight the challenge posed by control flow relations for which further
training needs to be devised.

本论文研究了利用上下文学习和预训练语言表示模型来解决过程描述文档中信息提取的问题，与原生 GPT-3 模型一起，通过注入概念定义和少量样本进行了两个上下文学习操作。研究结果显示了该方法的潜力和上下文学习的有用性，同时也指出了控制流关系所带来的挑战。