Recently, in-context learning (ICL) on large language models (LLMs) has received great attention, and this technique can also be applied to vision-language models (VLMs) built upon LLMs. These VLMs can respond to queries by conditioning responses on a series of multimodal demonstrations, which comprise images, queries, and answers. Though ICL has been extensively studied on LLMs, its research on VLMs remains limited. The inclusion of additional visual information in the demonstrations motivates the following research questions: which of the two modalities in the demonstration is more significant? How can we select effective multimodal demonstrations to enhance ICL performance? This study investigates the significance of both visual and language information. Our findings indicate that ICL in VLMs is predominantly driven by the textual information in the demonstrations whereas the visual information in the demonstrations barely affects the ICL performance. Subsequently, we provide an understanding of the findings by analyzing the model information flow and comparing model inner states given different ICL settings. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES), which considers both visual and language modalities when selecting demonstrations and shows better ICL performance. Extensive experiments are conducted to support our findings, understanding, and improvement of the ICL performance of VLMs.

通过对视觉-语言模型的大规模（LLMs）模型进行背景学习（ICL），本研究发现在VLMs中，ICL主要受到演示的文本信息的驱动，视觉信息对ICL性能的影响较小。鉴于该发现，通过分析模型信息流和不同ICL设置下的模型内部状态，我们提出了一个简单而有效的方法MMICES（Mixed Modality In-Context Example Selection），它在选择演示时考虑了视觉和语言两个模态，并显示出更好的ICL性能。通过大量实验证实了我们的发现，对VLMs的ICL性能的理解和改进进行了支持。

理解和优化视觉-语言模型中的上下文学习