This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought (CoT) reasoning. While MLLMs have shown promise in various visual tasks, they often lack interpretability and struggle with complex visual inputs. To address these challenges, we propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts. We collect and introduce the Visual CoT dataset comprising 373k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions. Importantly, the introduced benchmark is capable of evaluating MLLMs in scenarios requiring specific local region identification. Extensive experiments demonstrate the effectiveness of our framework and shed light on better inference strategies. The Visual CoT dataset, benchmark, and pre-trained models are available to foster further research in this direction.

该论文提出了Visual CoT，一种利用多模态大型语言模型（MLLMs）的推理能力的新型流程，通过结合可解释性认知链条（CoT）推理来处理复杂的视觉输入，并提供可解释的思路。我们收集并引入了Visual CoT数据集，该数据集包含373k个问题-答案对，通过中间边界框突出显示回答问题所必要的关键区域，能够评估在需要特定局部区域识别的场景中的MLLMs的性能。大量实验证明了我们的框架的有效性，并为更好的推理策略提供了启示。Visual CoT数据集、基准和预训练模型可用于促进相关方向的进一步研究。

视觉CoT：在多模态语言模型中释放连续思维推理