Neural captioners are typically trained to mimic human-generated references
without optimizing for any specific communication goal, leading to problems
such as the generation of vague captions. In this paper, we show that
fine-tuning an out-of-the-box neural captioner with a self-supervised
discriminative communication objective helps to recover a plain, visually
descriptive language that is more informative about image contents. Given a
target image, the system must learn to produce a description that enables an
out-of-the-box text-conditioned image retriever to identify such image among a
set of candidates. We experiment with the popular ClipCap captioner, also
replicating the main results with BLIP. In terms of similarity to ground-truth
human descriptions, the captions emerging from discriminative finetuning lag
slightly behind those generated by the non-finetuned model, when the latter is
trained and tested on the same caption dataset. However, when the model is used
without further tuning to generate captions for out-of-domain datasets, our
discriminatively-finetuned captioner generates descriptions that resemble human
references more than those produced by the same captioner without finetuning.
We further show that, on the Conceptual Captions dataset, discriminatively
finetuned captions are more helpful than either vanilla ClipCap captions or
ground-truth captions for human annotators tasked with an image discrimination
task.

本文提出在自我监督的证明性沟通目标情况下对预先训练的神经字幕系统进行微调，使其生成更详细的图像描述，并在 Conceptual Captions 数据集上进行验证。

具有判别性微调的跨领域图像字幕生成

Cross-Domain Image Captioning with Discriminative Finetuning

Multimodal image-language transformers have achieved impressive results on a
variety of tasks that rely on fine-tuning (e.g., visual question answering and
image retrieval). We are interested in shedding light on the quality of their
pretrained representations -- in particular, if these models can distinguish
different types of verbs or if they rely solely on nouns in a given sentence.
To do so, we collect a dataset of image-sentence pairs (in English) consisting
of 421 verbs that are either visual or commonly found in the pretraining data
(i.e., the Conceptual Captions dataset). We use this dataset to evaluate
pretrained image-language transformers and find that they fail more in
situations that require verb understanding compared to other parts of speech.
We also investigate what category of verbs are particularly challenging.

本文研究多模态图像语言变换器的预训练表示质量，研究表明在需要谓语理解的情况下这些模型的表现不佳，通过图像 - 语句对数据集评估模型性能，分类词汇类型并找到特别具有挑战性的词汇类型。

探索图像 - 语言变换器的动词理解

Probing Image-Language Transformers for Verb Understanding

We introduce a new pre-trainable generic representation for visual-linguistic
tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the
simple yet powerful Transformer model as the backbone, and extends it to take
both visual and linguistic embedded features as input. In it, each element of
the input is either of a word from the input sentence, or a region-of-interest
(RoI) from the input image. It is designed to fit for most of the
visual-linguistic downstream tasks. To better exploit the generic
representation, we pre-train VL-BERT on the massive-scale Conceptual Captions
dataset, together with text-only corpus. Extensive empirical analysis
demonstrates that the pre-training procedure can better align the
visual-linguistic clues and benefit the downstream tasks, such as visual
commonsense reasoning, visual question answering and referring expression
comprehension. It is worth noting that VL-BERT achieved the first place of
single model on the leaderboard of the VCR benchmark. Code is released at
https://github.com/jackroos/VL-BERT.

本研究引入一个新的预可训练的通用视觉语言表示方法 ——Visual-Linguistic BERT，它采用了简单而强大的 Transformer 模型作为骨干网络，并将视觉和语言嵌入特征扩展为输入。通过在大规模的 Conceptual Captions 数据集上进行文本预训练，VL-BERT 可以适配大多数视觉语言下游任务，并在可视化常识推理、视觉问答、指称理解等下游任务中取得了不错的效果。