In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model. The speech units mainly contain linguistic information while suppressing other characteristics of speech. This allows us to incorporate the language modeling capability of the pre-trained vision-language model into the spoken language modeling of Im2Sp. With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases, COCO and Flickr8k. Then, we further improve the efficiency of the Im2Sp model. Similar to the speech unit case, we convert the original image into image units, which are derived through vector quantization of the raw image. With these image units, we can drastically reduce the required data storage for saving image data to just 0.8% when compared to the original image data in terms of bits. Demo page: https://ms-dot-k.github.io/Image-to-Speech-Captioning.

本文提出了一种强大而高效的图像到语音字幕（Im2Sp）模型构建方法，引入了大规模预训练的视觉-语言模型相关知识，并将其输出设置为离散化的语音单元，即自我监督语音模型的量化语音特征，以实现将预训练的视觉-语言模型的语言建模能力融入到Im2Sp的口语化建模中，从而在广泛使用的基准数据库COCO和Flickr8k上取得了新的最先进的Im2Sp性能，并进一步提高了Im2Sp模型的效率。

基于视觉语言预训练和多模态令牌的实用高效图像语音字幕生成