We introduce a new task, visual sense disambiguation for verbs: given an image and a verb, assign the correct sense of the verb, i.e., the one that describes the action depicted in the image. Just as textual word sense disambiguation is useful for a wide range of NLP tasks, visual sense disambiguation can be useful for multimodal tasks such as image retrieval, image description, and text illustration. We introduce VerSe, a new dataset that augments existing multimodal datasets (COCO and TUHOI) with sense labels. We propose an unsupervised algorithm based on Lesk which performs visual sense disambiguation using textual, visual, or multimodal embeddings. We find that textual embeddings perform well when gold-standard textual annotations (object labels and image descriptions) are available, while multimodal embeddings perform well on unannotated images. We also verify our findings by using the textual and multimodal embeddings as features in a supervised setting and analyse the performance of visual sense disambiguation task. VerSe is made publicly available and can be downloaded at: https://github.com/spandanagella/verse.

本文介绍了一项新任务：为动词进行视觉意义消歧，以此作为多模态任务如图像检索和图像描述的基础，并提出了基于Lesk算法的无监督算法来执行视觉意义消歧，说明了在有和无标注图像情况下，文本嵌入和多模态嵌入的性能。本文最终提供了VerSe数据集，并提供了下载链接。

基于多模态嵌入的动词无监督视觉语义消歧