Recently, Vision Language Models (VLMs) have gained significant attention, exhibiting notable advancements across various tasks by leveraging extensive image-text paired data. However, prevailing VLMs often treat Visual Question Answering (VQA) as perception tasks, employing black-box models that overlook explicit modeling of relationships between different questions within the same visual scene. Moreover, the existing VQA methods that rely on Knowledge Bases (KBs) might frequently encounter biases from limited data and face challenges in relevant information indexing. Attempt to overcome these limitations, this paper introduces an explainable multi-agent collaboration framework by tapping into knowledge embedded in Large Language Models (LLMs) trained on extensive corpora. Inspired by human cognition, our framework uncovers latent information within the given question by employing three agents, i.e., Seeker, Responder, and Integrator, to perform a top-down reasoning process. The Seeker agent generates relevant issues related to the original question. The Responder agent, based on VLM, handles simple VQA tasks and provides candidate answers. The Integrator agent combines information from the Seeker agent and the Responder agent to produce the final VQA answer. Through the above collaboration mechanism, our framework explicitly constructs a multi-view knowledge base for a specific image scene, reasoning answers in a top-down processing manner. We extensively evaluate our method on diverse VQA datasets and VLMs, demonstrating its broad applicability and interpretability with comprehensive experimental results.

本论文提出了一个可解释的多智能体协作框架，通过利用在广泛语料库上训练的大型语言模型中嵌入的知识，以人类认知为灵感，使用三个智能体，即探索者、回答者和整合者，进行自顶向下推理过程，从而明确地构建特定图像场景的多视图知识库，以自顶向下的处理方式推理答案。我们在多样化的视觉问答数据集和视觉语言模型上对我们的方法进行了广泛评估，并通过全面的实验结果证明了其广泛的适用性和可解释性。

走向自顶向下推理：可解释的多代理视觉问答方法