We propose a novel algorithm, named Open-Edit, which is the first attempt on
open-domain image manipulation with open-vocabulary instructions. It is a
challenging task considering the large variation of image domains and the lack
of training supervision. Our approach takes advantage of the unified
visual-semantic embedding space pretrained on a general image-caption dataset,
and manipulates the embedded visual features by applying text-guided vector
arithmetic on the image feature maps. A structure-preserving image decoder then
generates the manipulated images from the manipulated feature maps. We further
propose an on-the-fly sample-specific optimization approach with
cycle-consistency constraints to regularize the manipulated images and force
them to preserve details of the source images. Our approach shows promising
results in manipulating open-vocabulary color, texture, and high-level
attributes for various scenarios of open-domain images.

提出了 Open-Edit 算法，是一种处理开放域图像操作的新方法，采用基于文本图像翻译和生成的方式来操作图像，通过结构保持的图像解码器，调整图像特征映射来生成所需的操作图像。该方法在对开放词汇的颜色、纹理和高级特征进行处理方面取得了良好的结果。

Open-Edit：使用开放词汇说明的开放领域图像编辑

Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary  Instructions

Unsupervised machine translation (MT) has recently achieved impressive
results with monolingual corpora only. However, it is still challenging to
associate source-target sentences in the latent space. As people speak
different languages biologically share similar visual systems, the potential of
achieving better alignment through visual content is promising yet
under-explored in unsupervised multimodal MT (MMT). In this paper, we
investigate how to utilize visual content for disambiguation and promoting
latent space alignment in unsupervised MMT. Our model employs multimodal
back-translation and features pseudo visual pivoting in which we learn a shared
multilingual visual-semantic embedding space and incorporate visually-pivoted
captioning as additional weak supervision. The experimental results on the
widely used Multi30K dataset show that the proposed model significantly
improves over the state-of-the-art methods and generalizes well when the images
are not available at the testing time.

本研究探讨如何利用视觉内容实现无监督多模态机器翻译领域的降歧和提升潜空间的对齐度。该模型采用多模态反向翻译，具备伪视觉枢轴功能，实现了多语言视觉 - 语义嵌入空间学习和视觉轴描述补充弱监督。实验证明该模型显著超越了最先进的方法，并能在测试时很好地进行泛化。

基于伪可视中心词的无监督多模态神经机器翻译

Unsupervised Multimodal Neural Machine Translation with Pseudo Visual  Pivoting

With the aim of promoting and understanding the multilingual version of image
search, we leverage visual object detection and propose a model with diverse
multi-head attention to learn grounded multilingual multimodal representations.
Specifically, our model attends to different types of textual semantics in two
languages and visual objects for fine-grained alignments between sentences and
images. We introduce a new objective function which explicitly encourages
attention diversity to learn an improved visual-semantic embedding space. We
evaluate our model in the German-Image and English-Image matching tasks on the
Multi30K dataset, and in the Semantic Textual Similarity task with the English
descriptions of visual content. Results show that our model yields a
significant performance gain over other methods in all of the three tasks.

本文提出了一种基于视觉物体检测和不同文本语义的多语言多模态表示的模型，采用多头注意力机制对两种语言的文本语义和视觉对象进行细粒度对齐，从而学习到更好的视觉 - 语义嵌入空间，并在多个任务上展现了比其他方法更显著的性能提升。