Large language models (LLMs) famously exhibit emergent in-context learning
(ICL) -- the ability to rapidly adapt to new tasks using few-shot examples
provided as a prompt, without updating the model's weights. Built on top of
LLMs, vision large language models (VLLMs) have advanced significantly in areas
such as recognition, reasoning, and grounding. However, investigations into
\emph{multimodal ICL} have predominantly focused on few-shot visual question
answering (VQA), and image captioning, which we will show neither exploit the
strengths of ICL, nor test its limitations. The broader capabilities and
limitations of multimodal ICL remain under-explored. In this study, we
introduce a comprehensive benchmark VL-ICL Bench for multimodal in-context
learning, encompassing a broad spectrum of tasks that involve both images and
text as inputs and outputs, and different types of challenges, from {perception
to reasoning and long context length}. We evaluate the abilities of
state-of-the-art VLLMs against this benchmark suite, revealing their diverse
strengths and weaknesses, and showing that even the most advanced models, such
as GPT-4, find the tasks challenging. By highlighting a range of new ICL tasks,
and the associated strengths and limitations of existing models, we hope that
our dataset will inspire future work on enhancing the in-context learning
capabilities of VLLMs, as well as inspire new applications that leverage VLLM
ICL. The code and dataset are available at this https URL

该研究介绍了一个全面的多模态上下文学习基准测试 VL-ICL Bench，评估了先进的视觉大语言模型在这个基准测试套件上的能力，揭示了它们的各种优势和弱点，并表明即使是最先进的模型，如 GPT-4，也会在这些任务中面临挑战。

VL-ICL Bench: 基于多模态上下文学习的基准测试中的细节之魔鬼

VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal  In-Context Learning

The integration of visual inputs with large language models (LLMs) has led to
remarkable advancements in multi-modal capabilities, giving rise to visual
large language models (VLLMs). However, effectively harnessing VLLMs for
intricate visual perception tasks remains a challenge. In this paper, we
present a novel end-to-end framework named PerceptionGPT, which efficiently and
effectively equips the VLLMs with visual perception abilities by leveraging the
representation power of LLMs' token embedding. Our proposed method treats the
token embedding of the LLM as the carrier of spatial information, then leverage
lightweight visual task encoders and decoders to perform visual perception
tasks (e.g., detection, segmentation). Our approach significantly alleviates
the training difficulty suffered by previous approaches that formulate the
visual outputs as discrete tokens, and enables achieving superior performance
with fewer trainable parameters, less training data and shorted training time.
Moreover, as only one token embedding is required to decode the visual outputs,
the resulting sequence length during inference is significantly reduced.
Consequently, our approach enables accurate and flexible representations,
seamless integration of visual perception tasks, and efficient handling of a
multiple of visual outputs. We validate the effectiveness and efficiency of our
approach through extensive experiments. The results demonstrate significant
improvements over previous methods with much fewer trainable parameters and GPU
hours, which facilitates future research in enabling LLMs with visual
perception abilities.

本文提出了一种名为 PerceptionGPT 的新型端到端框架，通过利用 LLMs 的 token 嵌入的表示能力，高效有效地赋予 VLLMs 视觉感知能力。该方法以 LLMs 的 token 嵌入作为空间信息的携带者，利用轻量级的视觉任务编码器和解码器执行视觉感知任务（如检测、分割），有效缓解了以往将视觉输出离散化为 token 的训练困难，并且能够在更少的可训练参数、较少的训练数据和较短的训练时间内实现更优越的性能。此外，由于推理过程中只需要一个 token 嵌入来解码视觉输出，结果序列长度可大幅减少。因此，该方法能够实现准确灵活的表示，无缝集成视觉感知任务，并高效处理多个视觉输出，通过广泛的实验证实了该方法的有效性和效率，结果表明，在更少的可训练参数和 GPU 时间的情况下取得了显著的改进，为未来赋予 LLMs 视觉感知能力的研究提供了便利。