Zero-shot audio captioning aims at automatically generating descriptive
textual captions for audio content without prior training for this task.
Different from speech recognition which translates audio content that contains
spoken language into text, audio captioning is commonly concerned with ambient
sounds, or sounds produced by a human performing an action. Inspired by
zero-shot image captioning methods, we propose ZerAuCap, a novel framework for
summarising such general audio signals in a text caption without requiring
task-specific training. In particular, our framework exploits a pre-trained
large language model (LLM) for generating the text which is guided by a
pre-trained audio-language model to produce captions that describe the audio
content. Additionally, we use audio context keywords that prompt the language
model to generate text that is broadly relevant to sounds. Our proposed
framework achieves state-of-the-art results in zero-shot audio captioning on
the AudioCaps and Clotho datasets. Our code is available at
this https URL

ZerAuCap 是一个新的框架，利用预训练的大型语言模型来生成既不需要任务特定训练，又能描述音频内容的文本标注，通过预先训练的音频 - 语言模型指导语言模型生成内容与音频相关的文本，使用音频上下文关键词来生成广义的文本，在 AudioCaps 和 Clotho 数据集中实现了最先进的结果。