We propose Fast Language-Audio Pre-training (FLAP), a self-supervised
approach that efficiently and effectively learns aligned audio and language
representations through masking, contrastive learning and reconstruction. For
efficiency, FLAP randomly drops audio spectrogram tokens, focusing solely on
the remaining ones for self-supervision. Through inter-modal contrastive
learning, FLAP learns to align paired audio and text representations in a
shared latent space. Notably, FLAP leverages multiple augmented views via
masking for inter-modal contrast and learns to reconstruct the masked portion
of audio tokens. Moreover, FLAP leverages large language models (LLMs) to
augment the text inputs, contributing to improved performance. These approaches
lead to more robust and informative audio-text representations, enabling FLAP
to achieve state-of-the-art (SoTA) performance on audio-text retrieval tasks on
AudioCaps (achieving 53.0% R@1) and Clotho (achieving 25.5% R@1).

我们提出了快速语音 - 文本预训练（FLAP）的自监督方法，通过屏蔽、对比学习和重构来有效地学习对齐的音频和语言表示。FLAP 通过随机丢弃音频频谱标记，仅关注自我监督的剩余标记，以提高效率。通过互模态对比学习，FLAP 学习将配对的音频和文本表示对齐在共享的潜在空间中。值得注意的是，FLAP 通过屏蔽多个增强视图，并学习重构音频标记的屏蔽部分。此外，FLAP 利用大型语言模型（LLM）增强文本输入，以提高性能。这些方法导致更强大和信息丰富的音频 - 文本表示，使得 FLAP 在 AudioCaps（实现了 53.0% 的 R@1）和 Clotho（实现了 25.5% 的 R@1）的音频 - 文本检索任务中达到最先进的性能。

FLAP：快速语言音频预训练

FLAP: Fast Language-Audio Pre-training

Recent advances in using language models to obtain cross-modal audio-text
representations have overcome the limitations of conventional training
approaches that use predefined labels. This has allowed the community to make
progress in tasks like zero-shot classification, which would otherwise not be
possible. However, learning such representations requires a large amount of
human-annotated audio-text pairs. In this paper, we study unsupervised
approaches to improve the learning framework of such representations with
unpaired text and audio. We explore domain-unspecific and domain-specific
curation methods to create audio-text pairs that we use to further improve the
model. We also show that when domain-specific curation is used in conjunction
with a soft-labeled contrastive loss, we are able to obtain significant
improvement in terms of zero-shot classification performance on downstream
sound event classification or acoustic scene classification tasks.

本文研究了使用无配对数据进行无监督学习的方法，结合领域特定的有软标签的对比损失方法可以显著提高跨模态音频 - 文本表示学习的效果及其在零样本分类任务中的性能。