Multimodal LLMs are the natural evolution of LLMs, and enlarge their
capabilities so as to work beyond the pure textual modality. As research is
being carried out to design novel architectures and vision-and-language
adapters, in this paper we concentrate on endowing such models with the
capability of answering questions that require external knowledge. Our
approach, termed Wiki-LLaVA, aims at integrating an external knowledge source
of multimodal documents, which is accessed through a hierarchical retrieval
pipeline. Relevant passages, using this approach, are retrieved from the
external knowledge source and employed as additional context for the LLM,
augmenting the effectiveness and precision of generated dialogues. We conduct
extensive experiments on datasets tailored for visual question answering with
external data and demonstrate the appropriateness of our approach.

我们提出了一种名为 Wiki-LLaVA 的方法，通过一个分层检索流程，将多模态文档的外部知识源集成到 LLM 中，用作额外的上下文，从而增强了生成的对话的效果和准确性。我们在具有外部数据的视觉问答数据集上进行了大量实验，并证明了该方法的适用性。