This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.

本文研究自监督通用音频表示学习的问题，探讨了在此任务中使用联合嵌入预测架构（JEPA），通过将输入的mel频谱图分割为上下文和目标两部分，计算每个部分的神经表示，并训练神经网络从上下文表示中预测目标表示。通过对各种音频分类基准进行广泛实验评估，包括环境声音、语音和音乐下游任务，我们研究了该框架中的几个设计选择，并研究了它们的影响。我们特别关注输入数据的哪部分被用作上下文或目标，并通过实验证明这显著影响模型的质量。尤其是，我们注意到在图像领域的一些有效的设计选择会导致音频上的性能下降，从而凸显了这两种模态之间的重要差异。

探索联合嵌入预测架构在一般音频表示学习中的设计选择