In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning
(ICL) remains limited by challenges in cross-modal interactions and
representation disparities. To overcome these challenges, we introduce a novel
Visual In-Context Learning (VICL) method comprising Visual Demonstration
Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented
Demonstration Composition. Our approach retrieves images via ''Retrieval &
Rerank'' paradigm, summarises images with task intent and task-specific visual
parsing, and composes language-based demonstrations that reduce token count and
alleviate cross-modal interaction problem. Experimental evaluations on five
visual reasoning datasets demonstrate the effectiveness of our method.
Moreover, our extensive experiments leverage information flow analysis to
elucidate the effectiveness of our method, and investigate the impact of length
and position of demonstrations for LVLM. The use of in-context unlearning
further shows promise in resetting specific model knowledge without retraining.

通过引入一种新颖的视觉上下文学习方法（VICL），包括视觉演示检索、目标导向图像摘要和目标导向演示组合，解决了大型视觉语言模型（LVLMs）中上下文学习的挑战，提高了效果，并且进一步调查了演示文本长度和位置对 LVLM 的影响，展示了 ICL 复位特定模型知识的潜力。