Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong
zero-shot transfer capability in many discriminative tasks. Their adaptation to
zero-shot image-conditioned text generation tasks has drawn increasing
interest. Prior arts approach to zero-shot captioning by either utilizing the
existing large language models (e.g., GPT-2) or pre-training the
encoder-decoder network in an end-to-end manner. In this work, we propose a
simple framework, named DeCap, for zero-shot captioning. We introduce a
lightweight visual-aware language decoder. This decoder is both data-efficient
and computation-efficient: 1) it only requires the text data for training,
easing the burden on the collection of paired data. 2) it does not require
end-to-end training. When trained with text-only data, the decoder takes the
text embedding extracted from the off-the-shelf CLIP encoder as a prefix
embedding. The challenge is that the decoder is trained on the text corpus but
at the inference stage, it needs to generate captions based on visual inputs.
The modality gap issue is widely observed in multi-modal contrastive models
that prevents us from directly taking the visual embedding as the prefix
embedding. We propose a training-free mechanism to reduce the modality gap. We
project the visual embedding into the CLIP text embedding space, while the
projected embedding retains the information of the visual input. Taking the
projected embedding as the prefix embedding, the decoder generates high-quality
descriptions that match the visual input. The experiments show that DeCap
outperforms other zero-shot captioning methods and unpaired captioning methods
on the typical image captioning benchmarks, i.e., MSCOCO and NoCaps.

该论文提出了一种名为 DeCap 的简单框架来解决零 - shot 图片描述问题，通过引入轻量级的视觉感知语言解码器来满足对数据和计算效率的要求，并提出了一个训练 - free 机制来减少模态间差异。实验证明，DeCap 在典型的图像说明基准测试中表现优异。