Multilingual image captioning has recently been tackled by training with
large-scale machine translated data, which is an expensive, noisy, and
time-consuming process. Without requiring any multilingual caption data, we
propose LMCap, an image-blind few-shot multilingual captioning model that works
by prompting a language model with retrieved captions. Specifically, instead of
following the standard encoder-decoder paradigm, given an image, LMCap first
retrieves the captions of similar images using a multilingual CLIP encoder.
These captions are then combined into a prompt for an XGLM decoder, in order to
generate captions in the desired language. In other words, the generation model
does not directly process the image, instead processing retrieved captions.
Experiments on the XM3600 dataset of geographically diverse images show that
our model is competitive with fully-supervised multilingual captioning models,
without requiring any supervised training on any captioning data.

提出了一种无需多语种字幕数据，使用基于检索式的模型 LMCap，在进行少样本学习的情况下完成多语种图像字幕生成，其步骤包括使用多语 CLIP 编码器检索相似图像的字幕，将这些字幕结合成 XGLM 解码器的提示来生成所需语言的字幕，并在实验中证明，该模型不需要在任何字幕数据上进行监督学习，即可与完全监督的多语种字幕生成模型相竞争。