Due to the limitations of the model structure and pre-training objectives, existing vision-and-language generation models cannot utilize pair-wise images and text through bi-directional generation. In this paper, we propose DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems. DU-VLG is trained with novel dual pre-training tasks: multi-modal denoising autoencoder tasks and modality translation tasks. To bridge the gap between image understanding and generation, we further design a novel commitment loss. We compare pre-training objectives on image captioning and text-to-image generation datasets. Results show that DU-VLG yields better performance than variants trained with uni-directional generation objectives or the variant without the commitment loss. We also obtain higher scores compared to previous state-of-the-art systems on three vision-and-language generation tasks. In addition, human judges further confirm that our model generates real and relevant images as well as faithful and informative captions.

本论文提出了一种名为DU-VLG的框架，该框架将视觉和语言生成视为序列生成问题，并通过双向生成，利用对图像和文本的成对处理。采用多模态降噪自编码器任务和模态翻译任务进行双重预训练，并设计了一种新的承诺损失方法，以提高图像生成的质量。研究结果表明，与采用单向生成目标或不使用承诺损失的变体相比，DU-VLG在图像字幕和文本到图像生成数据集上的性能更好，并在三个视觉和语言生成任务中获得了比以前的最先进系统更高的得分。此外，人类评测员进一步确认我们的模型生成了真实相关的图像并带有忠实和有信息的说明。

DU-VLG：通过双序列预训练统一视觉和语言生成