Visual Word Sense Disambiguation (VWSD) is a multi-modal task that aims to
select, among a batch of candidate images, the one that best entails the target
word's meaning within a limited context. In this paper, we propose a
multi-modal retrieval framework that maximally leverages pretrained
Vision-Language models, as well as open knowledge bases and datasets. Our
system consists of the following key components: (1) Gloss matching: a
pretrained bi-encoder model is used to match contexts with proper senses of the
target words; (2) Prompting: matched glosses and other textual information,
such as synonyms, are incorporated using a prompting template; (3) Image
retrieval: semantically matching images are retrieved from large open datasets
using prompts as queries; (4) Modality fusion: contextual information from
different modalities are fused and used for prediction. Although our system
does not produce the most competitive results at SemEval-2023 Task 1, we are
still able to beat nearly half of the teams. More importantly, our experiments
reveal acute insights for the field of Word Sense Disambiguation (WSD) and
multi-modal learning. Our code is available on GitHub.

我们提出了一个多模态检索框架，充分利用了预训练的视觉 - 语言模型、开放知识库和数据集，通过处理上下文与目标词的含义进行匹配、使用提示模板整合匹配的描述和其他文本信息进行图像检索、融合不同模态的上下文信息并用于预测，为词义消歧和多模态学习领域带来了深刻的见解。