Instruction following is crucial in contemporary LLM. However, when extended
to multimodal setting, it often suffers from misalignment between specific
textual instruction and targeted local region of an image. To achieve more
accurate and nuanced multimodal instruction following, we introduce
Instruction-guided Visual Masking (IVM), a new versatile visual grounding model
that is compatible with diverse multimodal models, such as LMM and robot model.
By constructing visual masks for instruction-irrelevant regions, IVM-enhanced
multimodal models can effectively focus on task-relevant image regions to
better align with complex instructions. Specifically, we design a visual
masking data generation pipeline and create an IVM-Mix-1M dataset with 1
million image-instruction pairs. We further introduce a new learning technique,
Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training
that prioritizes high-quality data samples. Experimental results on generic
multimodal tasks such as VQA and embodied robotic control demonstrate the
versatility of IVM, which as a plug-and-play tool, significantly boosts the
performance of diverse multimodal models, yielding new state-of-the-art results
across challenging multimodal benchmarks. Code is available at
this https URL

通过引入指导型视觉遮罩（IVM）来改进多模式指令跟踪，本研究在多模式设置下证明了 IVM 的适用性，并显示出在图像与指令之间进行准确的视觉对齐的优势。通过构建视觉遮罩，IVM 增强的多模式模型能够更好地关注与任务相关的图像区域，从而取得更好的指令跟踪表现。实验结果表明，IVM 作为一种即插即用工具，显著提升了多样化的多模式模型性能，在各种复杂多模式基准上取得了新的最佳结果。

指令引导下的视觉遮罩化

Instruction-Guided Visual Masking

Full-reference image quality metrics (FR-IQMs) aim to measure the visual
differences between a pair of reference and distorted images, with the goal of
accurately predicting human judgments. However, existing FR-IQMs, including
traditional ones like PSNR and SSIM and even perceptual ones such as HDR-VDP,
LPIPS, and DISTS, still fall short in capturing the complexities and nuances of
human perception. In this work, rather than devising a novel IQM model, we seek
to improve upon the perceptual quality of existing FR-IQM methods. We achieve
this by considering visual masking, an important characteristic of the human
visual system that changes its sensitivity to distortions as a function of
local image content. Specifically, for a given FR-IQM metric, we propose to
predict a visual masking model that modulates reference and distorted images in
a way that penalizes the visual errors based on their visibility. Since the
ground truth visual masks are difficult to obtain, we demonstrate how they can
be derived in a self-supervised manner solely based on mean opinion scores
(MOS) collected from an FR-IQM dataset. Our approach results in enhanced FR-IQM
metrics that are more in line with human prediction both visually and
quantitatively.

本文通过引入视觉遮盖的概念并在已有的 FR-IQM 模型上进行改进，提出了一种能够更准确地捕捉人类感知的新型图像质量评价方法。同时也提出了建立视觉遮盖模型的自监督学习方法，以此更好地预测图像质量。