We introduce a novel multimodal machine translation model that utilizes
parallel visual and textual information. Our model jointly optimizes the
learning of a shared visual-language embedding and a translator. The model
leverages a visual attention grounding mechanism that links the visual
semantics with the corresponding textual semantics. Our approach achieves
competitive state-of-the-art results on the Multi30K and the Ambiguous COCO
datasets. We also collected a new multilingual multimodal product description
dataset to simulate a real-world international online shopping scenario. On
this dataset, our visual attention grounding model outperforms other methods by
a large margin.

我们介绍了一种新颖的多模式机器翻译模型，利用平行的视觉和文本信息。该模型通过视觉注意力锚定机制链接视觉和文本语义，并实现共享的视觉 - 语言嵌入和翻译器的联合优化，取得了在 Multi30K 和 Ambiguous COCO 数据集上有竞争力的最新结果。我们还收集了一个新的多语言多模态产品描述数据集来模拟真实的国际在线购物场景。在这个数据集上，我们的视觉注意力锚定模型以大幅度优于其他方法的表现脱颖而出。