In this work, we investigate a more realistic unsupervised multimodal machine
translation (UMMT) setup, inference-time image-free UMMT, where the model is
trained with source-text image pairs, and tested with only source-text inputs.
First, we represent the input images and texts with the visual and language
scene graphs (SG), where such fine-grained vision-language features ensure a
holistic understanding of the semantics. To enable pure-text input during
inference, we devise a visual scene hallucination mechanism that dynamically
generates pseudo visual SG from the given textual SG. Several SG-pivoting based
learning objectives are introduced for unsupervised translation training. On
the benchmark Multi30K data, our SG-based method outperforms the
best-performing baseline by significant BLEU scores on the task and setup,
helping yield translations with better completeness, relevance and fluency
without relying on paired images. Further in-depth analyses reveal how our
model advances in the task setting.

本研究旨在探讨一种更现实的无监督多模态机器翻译（UMMT）设定 —— 推理时无图片的 UMMT，在该模型中，模型使用源文本图像对进行训练，并且仅使用源文本输入进行测试。为实现推理时的纯文本输入，本文设计了一种视觉场景幻觉机制，用于动态生成伪视觉场景图像。通过无监督方法学习场景图像，本文提出了几个基于场景图象旋转的学习目标。在 Multi30K 数据的基准测试中，我们的 SG 方法在任务和设定上显著优于最佳基准线，帮助生成更完整，相关和流畅性的翻译，而不依赖成对图像。进一步的深入分析揭示了我们的模型如何在任务设置中取得进展。

场景图作为枢轴：基于视觉场景虚构的推理时无图像非监督多模式机器翻译

Scene Graph as Pivoting: Inference-time Image-free Unsupervised  Multimodal Machine Translation with Visual Scene Hallucination

The attention mechanism is an important part of the neural machine
translation (NMT) where it was reported to produce richer source representation
compared to fixed-length encoding sequence-to-sequence models. Recently, the
effectiveness of attention has also been explored in the context of image
captioning. In this work, we assess the feasibility of a multimodal attention
mechanism that simultaneously focus over an image and its natural language
description for generating a description in another language. We train several
variants of our proposed attention mechanism on the Multi30k multilingual image
captioning dataset. We show that a dedicated attention for each modality
achieves up to 1.6 points in BLEU and METEOR compared to a textual NMT
baseline.

本文将多模态注意力机制应用于图像字幕生成领域，通过在自然语言描述和图像上同时聚焦，实现了一种基于图像字幕的另一种语言描述生成方法，并在 Multi30k 数据集上取得了更好的效果。