Integration of Large Language Models (LLMs) into visual domain tasks,
resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in
vision-language tasks, particularly for visual question answering (VQA).
However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial
reasoning and localization awareness. Despite generating highly descriptive and
elaborate textual answers, these models fail at simple tasks like
distinguishing a left vs right location. In this work, we explore how
image-space coordinate based instruction fine-tuning objectives could inject
spatial awareness into V-LLMs. We discover optimal coordinate representations,
data-efficient instruction fine-tuning objectives, and pseudo-data generation
strategies that lead to improved spatial awareness in V-LLMs. Additionally, our
resulting model improves VQA across image and video domains, reduces undesired
hallucination, and generates better contextual object descriptions. Experiments
across 5 vision-language tasks involving 14 different datasets establish the
clear performance improvements achieved by our proposed framework.

将大型语言模型（LLM）整合到视觉领域任务中，从而形成视觉 - LLM（V-LLM），在视觉问答（VQA）等视觉语言任务中实现了卓越的性能。通过基于图像坐标的指导微调目标，我们探索了如何为 V-LLM 注入空间意识，包括发现最佳坐标表示、数据效率的指导微调目标和伪数据生成策略。我们的模型在图像和视频领域提升了 VQA 性能，减少了不必要的幻觉，并生成了更好的上下文对象描述。通过涉及 14 个不同数据集的 5 个视觉语言任务的实验，验证了我们提出的框架明显的性能改进。