Knowledge retrieval with multi-modal queries plays a crucial role in
supporting knowledge-intensive multi-modal applications. However, existing
methods face challenges in terms of their effectiveness and training
efficiency, especially when it comes to training and integrating multiple
retrievers to handle multi-modal queries. In this paper, we propose an
innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can
effectively serve as virtual knowledge bases, even when trained with limited
data. We retrieve knowledge via a two-step process: 1) generating knowledge
clues related to the queries, and 2) obtaining the relevant document by
searching databases using the knowledge clue. In particular, we first introduce
an object-aware prefix-tuning technique to guide multi-grained visual learning.
Then, we align multi-grained visual features into the textual feature space of
the LLM, employing the LLM to capture cross-modal interactions. Subsequently,
we construct instruction data with a unified format for model training.
Finally, we propose the knowledge-guided generation strategy to impose prior
constraints in the decoding steps, thereby promoting the generation of
distinctive knowledge clues. Through experiments conducted on three benchmarks,
we demonstrate significant improvements ranging from 3.0% to 14.6% across all
evaluation metrics when compared to strong baselines.

我们提出了一种创新的端到端生成框架，用于多模态知识检索，通过利用大型语言模型 (LLMs) 作为虚拟知识库，使用对象感知的前缀调优技术来指导多粒度视觉学习，将多粒度视觉特征对齐到 LLM 的文本特征空间中，通过统一格式的指令数据构建模型训练，最后，我们提出了知识引导的生成策略，在解码步骤中施加先验约束，促进独特知识线索的生成，在三个基准测试中实验证明，与强基线方法相比，在所有评估指标上均取得了 3.0% 到 14.6% 的显著改进。

利用大型语言模型的生成式多模态知识检索

Generative Multi-Modal Knowledge Retrieval with Large Language Models

We investigate knowledge retrieval with multi-modal queries, i.e. queries
containing information split across image and text inputs, a challenging task
that differs from previous work on cross-modal retrieval. We curate a new
dataset called ReMuQ for benchmarking progress on this task. ReMuQ requires a
system to retrieve knowledge from a large corpus by integrating contents from
both text and image queries. We introduce a retriever model ``ReViz'' that can
directly process input text and images to retrieve relevant knowledge in an
end-to-end fashion without being dependent on intermediate modules such as
object detectors or caption generators. We introduce a new pretraining task
that is effective for learning knowledge retrieval with multimodal queries and
also improves performance on downstream tasks. We demonstrate superior
performance in retrieval on two datasets (ReMuQ and OK-VQA) under zero-shot
settings as well as further improvements when finetuned on these datasets.

本文介绍了一个新的数据集 ReMuQ，针对跨媒体检索的任务，提出了一个直接处理文本和图像输入的 Retriever 模型 `ReViz`，并引入了一个新的预训练任务，实现了对多模态查询的知识检索，并在两个数据集上取得了优秀的检索效果。

多模态查询的端到端知识检索

End-to-end Knowledge Retrieval with Multi-modal Queries

We introduce MQ-Det, an efficient architecture and pre-training strategy
design to utilize both textual description with open-set generalization and
visual exemplars with rich description granularity as category queries, namely,
Multi-modal Queried object Detection, for real-world detection with both
open-vocabulary categories and various granularity. MQ-Det incorporates vision
queries into existing well-established language-queried-only detectors. A
plug-and-play gated class-scalable perceiver module upon the frozen detector is
proposed to augment category text with class-wise visual information. To
address the learning inertia problem brought by the frozen detector, a vision
conditioned masked language prediction strategy is proposed. MQ-Det's simple
yet effective architecture and training strategy design is compatible with most
language-queried object detectors, thus yielding versatile applications.
Experimental results demonstrate that multi-modal queries largely boost
open-world detection. For instance, MQ-Det significantly improves the
state-of-the-art open-set detector GLIP by +7.8% zero-shot AP on the LVIS
benchmark and averagely +6.3% AP on 13 few-shot downstream tasks, with merely
3% pre-training time required by GLIP. Code is available at
this https URL

MQ-Det 是一种多模态查询目标检测方法，结合了文本和图像作为类别查询，该方法通过在现有的只有文本的检测器中插入可扩展的感知模块，将类别文本与类别视觉信息相结合，并提出了一种视觉条件掩码语言预测策略，可以显著提高开放式检测的性能。