Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks. In this paper, we propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks. We combine the well-known wav2vec 2.0 framework, which has shown success in self-supervised learning for speech tasks, with parameter-efficient conformer architectures. On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset through audio-only self-supervised learning. Our fine-tuned conformers also surpass or match the performance of previous systems pre-trained in a supervised way on several downstream tasks. We further discuss the important design considerations for both pre-training and fine-tuning.

本文提出了一种自监督的音频表征学习方法并将其应用于多种非语音音频任务，这种自监督的预训练可以将标记数据需求减少三分之二，并在 AudioSet 基准测试中通过声音自主训练实现了 0.415 的平均平均精度（mAP）得分，在多个下游任务中，我们的 fine-tuned conformers 也超越或匹配以往以监督方式预训练的系统的性能。

基于Conformer的自监督学习用于非语音音频任务