Well-formed context aware image captions and tags in enterprise content such
as marketing material are critical to ensure their brand presence and content
recall. Manual creation and updates to ensure the same is non trivial given the
scale and the tedium towards this task. We propose a new unified
Vision-Language (VL) model based on the One For All (OFA) model, with a focus
on context-assisted image captioning where the caption is generated based on
both the image and its context. Our approach aims to overcome the
context-independent (image and text are treated independently) nature of the
existing approaches. We exploit context by pretraining our model with datasets
of three tasks: news image captioning where the news article is the context,
contextual visual entailment, and keyword extraction from the context. The
second pretraining task is a new VL task, and we construct and release two
datasets for the task with 1.1M and 2.2K data instances. Our system achieves
state-of-the-art results with an improvement of up to 8.34 CIDEr score on the
benchmark news image captioning datasets. To the best of our knowledge, ours is
the first effort at incorporating contextual information in pretraining the
models for the VL tasks.

本文提出了一个基于 context-aware image captioning 的 unified Vision-Language (VL) model，并利用 pretraining 技术解决了 context-independent 问题，以达到比以前更好的效果。

不要断章取义：统一的视觉语言预训练为上下文辅助的图像字幕生成

"Let's not Quote out of Context": Unified Vision-Language Pretraining  for Context Assisted Image Captioning

We consider the task of image-captioning using only the CLIP model and
additional text data at training time, and no additional captioned images. Our
approach relies on the fact that CLIP is trained to make visual and textual
embeddings similar. Therefore, we only need to learn how to translate CLIP
textual embeddings back into text, and we can learn how to do this by learning
a decoder for the frozen CLIP text encoder using only text. We argue that this
intuition is "almost correct" because of a gap between the embedding spaces,
and propose to rectify this via noise injection during training. We demonstrate
the effectiveness of our approach by showing SOTA zero-shot image captioning
across four benchmarks, including style transfer. Code, data, and models are
available on GitHub.

本文提出一种使用 CLIP 模型和文本数据进行图像字幕生成的方法，只需学习如何将文本嵌入转化为文本，故只需学习一个将固定的嵌入解码的解码器，通过噪声注入进行训练，实现了 SOTA 零样本图像字幕生成。