Vision-language models (VLMs) achieve remarkable success in single-image tasks. However, real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline as models struggle to disentangle critical information scattered across complex visual features. In this work, we propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios. To facilitate this paradigm, we propose Focus-Centric Data Synthesis, a scalable bottom-up approach for synthesizing high-quality data with elaborate reasoning paths. Through this approach, We construct VISC-150K, a large-scale dataset with reasoning data in the form of Focus-Centric Visual Chain, specifically designed for multi-image tasks. Experimental results on seven multi-image benchmarks demonstrate that our method achieves average performance gains of 3.16% and 2.24% across two distinct model architectures, without compromising the general vision-language capabilities. our study represents a significant step toward more robust and capable vision-language systems that can handle complex visual scenarios.

本研究解决了视觉-语言模型在处理复杂多图像输入时的性能下降问题。我们提出了一种新颖的聚焦中心视觉链范式，通过聚焦中心数据合成方法生成高质量的数据，构建了一个专为多图像任务设计的大规模数据集VISC-150K。实验结果表明，该方法在不同模型架构上平均提高性能3.16%和2.24%，推动了视觉-语言系统在复杂视觉场景下的能力提升。

跨图像编织上下文：通过聚焦中心视觉链改善视觉-语言模型