BriefGPT.xyz
Nov, 2024
视觉问答的多模态常识知识蒸馏
Multimodal Commonsense Knowledge Distillation for Visual Question Answering
HTML
PDF
Shuo Yang, Siwen Luo, Soyeon Caren Han
TL;DR
本研究旨在解决现有多模态大语言模型和视觉语言预训练模型在处理需要外部常识知识的视觉问答问题时面临的挑战。我们提出了一种新的基于图的多模态常识知识蒸馏框架,该框架利用图卷积网络构建一个统一的关系图,有效整合常识知识、视觉对象和问题,并在ScienceQA数据集上取得了竞争力的表现。
Abstract
Existing
Multimodal
Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general
Visual Question Answering
(VQA). However, these models struggle wi
→