We study the Knowledge-Based visual question-answering problem, for which
given a question, the models need to ground it into the visual modality to find
the answer. Although many recent works use question-dependent captioners to
verbalize the given image and use Large Language Models to solve the VQA
problem, the research results show they are not reasonably performing for
multi-hop questions. Our study shows that replacing a complex question with
several simpler questions helps to extract more relevant information from the
image and provide a stronger comprehension of it. Moreover, we analyze the
decomposed questions to find out the modality of the information that is
required to answer them and use a captioner for the visual questions and LLMs
as a general knowledge source for the non-visual KB-based questions. Our
results demonstrate the positive impact of using simple questions before
retrieving visual or non-visual information. We have provided results and
analysis on three well-known VQA datasets including OKVQA, A-OKVQA, and KRVQA,
and achieved up to 2% improvement in accuracy.

我们研究了基于知识的视觉问答问题，通过将复杂问题替换为多个简单问题，从图像中提取更相关的信息来增强对图像的理解，并在三个著名的视觉问答数据集中实现了高达 2% 的准确率提升。

基于问题分解的知识驱动和视觉推理解耦的知识图谱可视问答

Disentangling Knowledge-based and Visual Reasoning by Question  Decomposition in KB-VQA

Generative training has been demonstrated to be powerful for building
visual-language models. However, on zero-shot discriminative benchmarks, there
is still a performance gap between models trained with generative and
discriminative objectives. In this paper, we aim to narrow this gap by
improving the efficacy of generative training on classification tasks, without
any finetuning processes or additional modules.
Specifically, we focus on narrowing the gap between the generative captioner
and the CLIP classifier. We begin by analysing the predictions made by the
captioner and classifier and observe that the caption generation inherits the
distribution bias from the language model trained with pure text modality,
making it less grounded on the visual signal. To tackle this problem, we
redesign the scoring objective for the captioner to alleviate the
distributional bias and focus on measuring the gain of information brought by
the visual inputs. We further design a generative training objective to match
the evaluation objective. We name our model trained and evaluated from the
novel procedures as Information Gain (IG) captioner. We pretrain the models on
the public Laion-5B dataset and perform a series of discriminative evaluations.
For the zero-shot classification on ImageNet, IG captioner achieves $> 18\%$
improvements over the standard captioner, achieving comparable performances
with the CLIP classifier. IG captioner also demonstrated strong performance on
zero-shot image-text retrieval tasks on MSCOCO and Flickr30K. We hope this
paper inspires further research towards unifying generative and discriminative
training procedures for visual-language models.

通过改进生成式训练中的评价目标，研究致力于缩小生成式字幕生成器和 CLIP 分类器之间的差距，实现零样本图像分类和图像 - 文本检索任务上表现可比的效果，并希望进一步研究将生成式与判别式训练程序统一的方法。