The availability of large-scale image captioning and visual question
answering datasets has contributed significantly to recent successes in
vision-and-language pre-training. However, these datasets are often collected
with overrestrictive requirements inherited from their original target tasks
(e.g., image caption generation), which limit the resulting dataset scale and
diversity. We take a step further in pushing the limits of vision-and-language
pre-training data by relaxing the data collection pipeline used in Conceptual
Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M
(CC12M), a dataset with 12 million image-text pairs specifically meant to be
used for vision-and-language pre-training. We perform an analysis of this
dataset and benchmark its effectiveness against CC3M on multiple downstream
tasks with an emphasis on long-tail visual recognition. Our results clearly
illustrate the benefit of scaling up pre-training data for vision-and-language
tasks, as indicated by the new state-of-the-art results on both the nocaps and
Conceptual Captions benchmarks.

通过松弛 Conceptual Captions 3M (CC3M) [Sharma et al. 2018] 数据收集流程，我们引入了 Conceptual 12M（CC12M）数据集，并通过针对长尾视觉识别的多个下游任务基准测试其有效性，结果表明增加预训练数据规模会使视觉和语言任务更加有效。

概念 12M：推动网页规模的图像文本预训练，以识别长尾视觉概念

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize  Long-Tail Visual Concepts

Image captioning models generally lack the capability to take into account
user interest, and usually default to global descriptions that try to balance
readability, informativeness, and information overload. On the other hand, VQA
models generally lack the ability to provide long descriptive answers, while
expecting the textual question to be quite precise. We present a method to
control the concepts that an image caption should focus on, using an additional
input called the guiding text that refers to either groundable or ungroundable
concepts in the image. Our model consists of a Transformer-based multimodal
encoder that uses the guiding text together with global and object-level image
features to derive early-fusion representations used to generate the guided
caption. While models trained on Visual Genome data have an in-domain advantage
of fitting well when guided with automatic object labels, we find that guided
captioning models trained on Conceptual Captions generalize better on
out-of-domain images and guiding texts. Our human-evaluation results indicate
that attempting in-the-wild guided image captioning requires access to large,
unrestricted-domain training datasets, and that increased style diversity (even
without increasing the number of unique tokens) is a key factor for improved
performance.

本文提出了一种使用指导文本来控制图像标题关注点的方法，使用基于 Transformer 的多模态编码器来生成标题，通过使用引导文本和全球和物体级别图像特征生成早期融合表示来生成标题，指导标题模型可较好地泛化用于外部领域的图像和指导文本，提高模型性能的关键因素是增加样式的多样性。