Image captioning aims at generating descriptive and meaningful textual
descriptions of images, enabling a broad range of vision-language applications.
Prior works have demonstrated that harnessing the power of Contrastive Image
Language Pre-training (CLIP) offers a promising approach to achieving zero-shot
captioning, eliminating the need for expensive caption annotations. However,
the widely observed modality gap in the latent space of CLIP harms the
performance of zero-shot captioning by breaking the alignment between paired
image-text features. To address this issue, we conduct an analysis on the CLIP
latent space which leads to two findings. Firstly, we observe that the CLIP's
visual feature of image subregions can achieve closer proximity to the paired
caption due to the inherent information loss in text descriptions. In addition,
we show that the modality gap between a paired image-text can be empirically
modeled as a zero-mean Gaussian distribution. Motivated by the findings, we
propose a novel zero-shot image captioning framework with text-only training to
reduce the modality gap. In particular, we introduce a subregion feature
aggregation to leverage local region information, which produces a compact
visual representation for matching text representation. Moreover, we
incorporate a noise injection and CLIP reranking strategy to boost captioning
performance. We also extend our framework to build a zero-shot VQA pipeline,
demonstrating its generality. Through extensive experiments on common
captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that
our method achieves remarkable performance improvements. Code is available at
this https URL

通过减少视觉和文本之间的模态差异，我们提出了一种零摄影机图片字幕框架，通过仅使用文本进行训练和引入局部图像区域特征聚合、噪声注入和 CLIP 排序策略来提高字幕性能，并证明其在 MSCOCO、Flickr30k 和 VQAV2 等数据集上具有显著的性能提升。

通过仅文本训练挖掘细粒度的图像 - 文本对齐用于零样本字幕生成

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via  Text-Only Training

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong
zero-shot transfer capability in many discriminative tasks. Their adaptation to
zero-shot image-conditioned text generation tasks has drawn increasing
interest. Prior arts approach to zero-shot captioning by either utilizing the
existing large language models (e.g., GPT-2) or pre-training the
encoder-decoder network in an end-to-end manner. In this work, we propose a
simple framework, named DeCap, for zero-shot captioning. We introduce a
lightweight visual-aware language decoder. This decoder is both data-efficient
and computation-efficient: 1) it only requires the text data for training,
easing the burden on the collection of paired data. 2) it does not require
end-to-end training. When trained with text-only data, the decoder takes the
text embedding extracted from the off-the-shelf CLIP encoder as a prefix
embedding. The challenge is that the decoder is trained on the text corpus but
at the inference stage, it needs to generate captions based on visual inputs.
The modality gap issue is widely observed in multi-modal contrastive models
that prevents us from directly taking the visual embedding as the prefix
embedding. We propose a training-free mechanism to reduce the modality gap. We
project the visual embedding into the CLIP text embedding space, while the
projected embedding retains the information of the visual input. Taking the
projected embedding as the prefix embedding, the decoder generates high-quality
descriptions that match the visual input. The experiments show that DeCap
outperforms other zero-shot captioning methods and unpaired captioning methods
on the typical image captioning benchmarks, i.e., MSCOCO and NoCaps.

该论文提出了一种名为 DeCap 的简单框架来解决零 - shot 图片描述问题，通过引入轻量级的视觉感知语言解码器来满足对数据和计算效率的要求，并提出了一个训练 - free 机制来减少模态间差异。实验证明，DeCap 在典型的图像说明基准测试中表现优异。

DeCap：通过纯文本训练对 CLIP 潜变量进行解码，实现零样本描述

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

When trained on large-scale datasets, image captioning models can understand
the content of images from a general domain but often fail to generate
accurate, detailed captions. To improve performance, pretraining-and-finetuning
has been a key strategy for image captioning. However, we find that large-scale
bidirectional training between image and text enables zero-shot image
captioning. In this paper, we introduce Bidirectional Image Text Training in
largER Scale, BITTERS, an efficient training and inference framework for
zero-shot image captioning. We also propose a new evaluation benchmark which
comprises of high quality datasets and an extensive set of metrics to properly
evaluate zero-shot captioning accuracy and societal bias. We additionally
provide an efficient finetuning approach for keyword extraction. We show that
careful selection of large-scale training set and model architecture is the key
to achieving zero-shot image captioning.

本文介绍了一种名为 BITTERS 的零 - shot 图像描述框架及数据集评估方法，通过双向图像文本训练以及精细调整提高图像描述精度。