Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by $25.3\%$. To support training and evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark consisting of $\sim$224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.

本研究解决了现有像素定位模型仅在单图像设置下工作的局限性，同时填补了多图像理解模型缺乏像素级定位的空白。我们提出了一种新任务—多图像像素定位推理分割，并推出了PRIMA模型，它将像素级定位与强大的多图像推理能力结合，生成富有上下文的像素定位解释。实验结果显示PRIMA在性能上优于当前最先进的基准模型。

PRIMA：用于推理分割的多图像视觉语言模型