Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.

我们提出了一种新的半监督图像检索方法，通过在辅助数据中搜索参考图像及其相关目标图像，并学习基于大型语言模型的视觉差异生成器（VDG），以生成描述两个图像之间视觉差异（即视觉增量）的文本。VDG具备流畅的语言知识和模型无关性，能够生成伪三元组来提升组合图像检索模型的性能。我们的方法显著改进了现有的监督学习方法，并在组合图像检索基准测试中取得了最先进的结果。

大型多模态模型的视觉增量生成器用于半监督组合图像检索