Starting from the resurgence of deep learning, vision-language models (VLMs) benefiting from large language models (LLMs) have never been so popular. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images. The issue can traced back to the architectural design of VLMs or pre-training data. Specifically, the current VLMs primarily emphasize utilizing multi-modal data with a single image some, rather than multi-modal prompts with interleaved multiple images and text. Even though some newly proposed VLMs could handle user prompts with multiple images, pre-training data does not provide more sophisticated multi-modal prompts than interleaved image and text crawled from the web. We propose MMICL to address the issue by considering both the model and data perspectives. We introduce a well-designed architecture capable of seamlessly integrating visual and textual context in an interleaved manner and MIC dataset to reduce the gap between the training data and the complex user prompts in real-world applications, including: 1) multi-modal context with interleaved images and text, 2) textual references for each image, and 3) multi-image data with spatial, logical, or temporal relationships. Our experiments confirm that MMICL achieves new stat-of-the-art zero-shot and few-shot performance on a wide range of general vision-language tasks, especially for complex reasoning benchmarks including MME and MMBench. Our analysis demonstrates that MMICL effectively deals with the challenge of complex multi-modal prompt understanding. The experiments on ScienceQA-IMG also show that MMICL successfully alleviates the issue of language bias in VLMs, which we believe is the reason behind the advanced performance of MMICL.

通过考虑模型和数据的角度，提出了MMICL去解决图像与文本交叉多模态提示的问题，通过无需训练的数据更好地适应用户真实应用中复杂的提示，其中包括多模态上下文与交叉的图像和文本、每个图像的文本参考以及具有空间、逻辑或时间关系的多图像数据。在广泛的视觉-语言任务中，特别是在复杂推理基准测试中，MMICL取得了新的最先进的零样本和少样本性能。同时，对ScienceQA-IMG上的实验表明MMICL成功缓解了视觉-语言模型中的语言偏差问题，我们相信这是MMICL卓越性能背后的原因。

MMICL: 视觉语言模型的多模态上下文学习