While deep-learning models have been shown to perform well on image-to-text
datasets, it is difficult to use them in practice for captioning images. This
is because \textit{captions} traditionally tend to be context-dependent and
offer complementary information about an image, while models tend to produce
\textit{descriptions} that describe the visual features of the image. Prior
research in caption generation has explored the use of models that generate
captions when provided with the images alongside their respective descriptions
or contexts. We propose and evaluate a new approach, which leverages existing
large language models to generate captions from textual descriptions and
context alone, without ever processing the image directly. We demonstrate that
after fine-tuning, our approach outperforms current state-of-the-art image-text
alignment models like OSCAR-VinVL on this task on the CIDEr metric.

本论文提出了一种新方法，使用大型语言模型从文本描述和上下文中生成图像字幕，而无需直接处理图像，经调优后，该方法在 CIDEr 指标上优于目前最先进的图像 - 文本对齐模型，解决了使用深度学习模型进行图像字幕生成时遭遇的一些难题。