The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation. To overcome these challenges, we explore strategies like fine-tuning and Chain-of-Thought prompting, demonstrating notable improvements. Our code and dataset are available at \url{https://github.com/UW-Madison-Lee-Lab/CoBSAT}.

将大型语言模型从文本到多模态进化为多模态大型语言模型(MLLMs)，并扩展了上下文学习 (ICL) 到多模态环境。本研究中提出以T2I-ICL为任务的新的benchmark数据集CoBSAT，通过与六个最先进的MLLMs算法的对比表明了T2I-ICL的困难及其挑战，并探索了fine-tuning和Chain-of-Thought prompting等策略以实现显著改进。

多语言大型语言模型是否能够进行上下文中的文本到图像学习？