A paraphrase is a restatement of the meaning of a text in other words.
Paraphrases have been studied to enhance the performance of many natural
language processing tasks. In this paper, we propose a novel task iParaphrasing
to extract visually grounded paraphrases (VGPs), which are different phrasal
expressions describing the same visual concept in an image. These extracted
VGPs have the potential to improve language and image multimodal tasks such as
visual question answering and image captioning. How to model the similarity
between VGPs is the key of iParaphrasing. We apply various existing methods as
well as propose a novel neural network-based method with image attention, and
report the results of the first attempt toward iParaphrasing.

本文提出了一种新的任务 iParaphrasing，通过提取基于视觉的复述词（VGPs）来改善语言和图像多模态任务的性能，使用各种现有方法和基于神经网络的图像注意力方法建模 VGPs 之间的相似性并报告了结果。

iParaphrasing：通过图像提取基于视觉的复述句

iParaphrasing: Extracting Visually Grounded Paraphrases via an Image

In state-of-the-art Neural Machine Translation (NMT), an attention mechanism
is used during decoding to enhance the translation. At every step, the decoder
uses this mechanism to focus on different parts of the source sentence to
gather the most useful information before outputting its target word. Recently,
the effectiveness of the attention mechanism has also been explored for
multimodal tasks, where it becomes possible to focus both on sentence parts and
image regions that they describe. In this paper, we compare several attention
mechanism on the multimodal translation task (English, image to German) and
evaluate the ability of the model to make use of images to improve translation.
We surpass state-of-the-art scores on the Multi30k data set, we nevertheless
identify and report different misbehavior of the machine while translating.

本文在多模态翻译任务（英文图片翻译德文）中比较了多种注意力机制，并评估了模型利用图像改进翻译的能力，虽然取得了 Multi30k 数据集上超越最先进水平的成绩，但我们也发现并报告了机器在翻译时表现出不同的不当行为。

多模态神经机器翻译中图像有效性的实证研究

An empirical study on the effectiveness of images in Multimodal Neural  Machine Translation

In state-of-the-art Neural Machine Translation, an attention mechanism is
used during decoding to enhance the translation. At every step, the decoder
uses this mechanism to focus on different parts of the source sentence to
gather the most useful information before outputting its target word. Recently,
the effectiveness of the attention mechanism has also been explored for
multimodal tasks, where it becomes possible to focus both on sentence parts and
image regions. Approaches to pool two modalities usually include element-wise
product, sum or concatenation. In this paper, we evaluate the more advanced
Multimodal Compact Bilinear pooling method, which takes the outer product of
two vectors to combine the attention features for the two modalities. This has
been previously investigated for visual question answering. We try out this
approach for multimodal image caption translation and show improvements
compared to basic combination methods.

本文探讨了在多模态翻译中使用复合双线性池化方法的效果，通过将两种注意力特征进行外积组合，相比于基本的组合方法，其对于图像字幕翻译的表现有所提升。