This short paper presents a preliminary analysis of three popular Visual
Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the
context of answering questions relating to driving scenarios. The performance
of these models is evaluated by comparing the similarity of responses to
reference answers provided by computer vision experts. Model selection is
predicated on the analysis of transformer utilization in multimodal
architectures. The results indicate that models incorporating cross-modal
attention and late fusion techniques exhibit promising potential for generating
improved answers within a driving perspective. This initial analysis serves as
a launchpad for a forthcoming comprehensive comparative study involving nine
VQA models and sets the scene for further investigations into the effectiveness
of VQA model queries in self-driving scenarios. Supplementary material is
available at
this https URL

这篇简短研究在回答与驾驶场景相关的问题的背景下，对 ViLBERT、ViLT 和 LXMERT 这三种流行的视觉问答（VQA）模型进行初步分析。通过比较计算机视觉专家提供的参考答案与模型输出答案的相似性来评估这些模型的性能。分析了多模态架构中的变换器利用情况来选择合适的模型，结果表明，结合跨模态注意力和后期融合技术的模型在驾驶场景中生成改进答案的潜力很大。这项初步分析为即将进行的涉及九个 VQA 模型的全面比较研究奠定了基础，同时为进一步研究 VQA 模型在自动驾驶场景中的有效性提供了背景。附加材料可在此网址获取：https://example.com/。