In this paper, we focus on training and evaluating effective word embeddings
with both text and visual information. More specifically, we introduce a
large-scale dataset with 300 million sentences describing over 40 million
images crawled and downloaded from publicly available Pins (i.e. an image with
sentence descriptions uploaded by users) on Pinterest. This dataset is more
than 200 times larger than MS COCO, the standard large-scale image dataset with
sentence descriptions. In addition, we construct an evaluation dataset to
directly assess the effectiveness of word embeddings in terms of finding
semantically similar or related words and phrases. The word/phrase pairs in
this evaluation dataset are collected from the click data with millions of
users in an image search system, thus contain rich semantic relationships.
Based on these datasets, we propose and compare several Recurrent Neural
Networks (RNNs) based multimodal (text and image) models. Experiments show that
our model benefits from incorporating the visual information into the word
embeddings, and a weight sharing strategy is crucial for learning such
multimodal embeddings. The project page is:
this http URL

本研究旨在使用文本和视觉信息进行有效的单词嵌入训练和评估。研究人员提出了一个大规模数据集，其中包含 300 万语句，描述了来自 Pinterest 的超过 4000 万张图像。该研究还报道了一种基于 RNN 的多模态模型，通过在嵌入中整合视觉信息，该模型可以找到语义相似或相关的单词和短语。经验表明，共享策略对于学习这种多模态嵌入至关重要。