Knowledge-based visual question answering requires the ability of associating
external knowledge for open-ended cross-modal scene understanding. One
limitation of existing solutions is that they capture relevant knowledge from
text-only knowledge bases, which merely contain facts expressed by first-order
predicates or language descriptions while lacking complex but indispensable
multimodal knowledge for visual understanding. How to construct vision-relevant
and explainable multimodal knowledge for the VQA scenario has been less
studied. In this paper, we propose MuKEA to represent multimodal knowledge by
an explicit triplet to correlate visual objects and fact answers with implicit
relations. To bridge the heterogeneous gap, we propose three objective losses
to learn the triplet representations from complementary views: embedding
structure, topological relation and semantic space. By adopting a pre-training
and fine-tuning learning strategy, both basic and domain-specific multimodal
knowledge are progressively accumulated for answer prediction. We outperform
the state-of-the-art by 3.35% and 6.08% respectively on two challenging
knowledge-required datasets: OK-VQA and KRVQA. Experimental results prove the
complementary benefits of the multimodal knowledge with existing knowledge
bases and the advantages of our end-to-end framework over the existing pipeline
methods. The code is available at this https URL

本研究提出了一种用于视觉问答的多模态知识表示方法 (MuKEA)，通过明确的三元组来关联视觉对象和事实答案，从而构建视觉相关和可解释的多模态知识，该方法可以有效提高对知识的需求的数据集上的表现。