Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl}{gitlab.com/folbaeni/multimodal-icl

通过对大型多模态模型的多模态ICL的研究，我们发现M-ICL主要依赖于文本驱动机制，几乎不受图像模态的影响。当与高级ICL策略（如RICES）一起使用时，M-ICL并不比基于大多数投票的上下文示例简单策略更好，此外，我们还发现了几种M-ICL的偏见和局限性，值得在部署之前考虑。

多模态上下文学习的关键是什么？