In this paper we present the first model for directly synthesizing fluent,
natural-sounding spoken audio captions for images that does not require natural
language text as an intermediate representation or source of supervision.
Instead, we connect the image captioning module and the speech synthesis module
with a set of discrete, sub-word speech units that are discovered with a
self-supervised visual grounding task. We conduct experiments on the Flickr8k
spoken caption dataset in addition to a novel corpus of spoken audio captions
collected for the popular MSCOCO dataset, demonstrating that our generated
captions also capture diverse visual semantics of the images they describe. We
investigate several different intermediate speech representations, and
empirically find that the representation must satisfy several important
properties to serve as drop-in replacements for text.

该研究提出了一种直接合成流利、自然发音的图像口述说明语音的模型，该模型不需要自然语言文本作为中间表示或监督来源，而是通过一组离散的、子词语音单元将图像说明模块和语音合成模块连接起来，这些语音单元是通过自我监督的视觉定位任务发现的。研究人员在 Flickr8k 口述说明数据集上进行了实验，并针对流行的 MSCOCO 数据集收集了一组新的口述说明语音语料库，证明了所生成的说明语音也捕捉到了它们所描述的图像的多样视觉语义。研究人员研究了几种不同的中间语音表示，并通过实验证明，这些表示必须满足几个重要的属性，才能作为文本的替代品。