Varying conditions between the data seen at training and at application time remain a major challenge for machine learning. We study this problem in the context of Acoustic Scene Classification (ASC) with mismatching recording devices. Previous works successfully employed frequency-wise normalization of inputs and hidden layer activations in convolutional neural networks to reduce the recording device discrepancy. The main objective of this work was to adopt frequency-wise normalization for Audio Spectrogram Transformers (ASTs), which have recently become the dominant model architecture in ASC. To this end, we first investigate how recording device characteristics are encoded in the hidden layer activations of ASTs. We find that recording device information is initially encoded in the frequency dimension; however, after the first self-attention block, it is largely transformed into the token dimension. Based on this observation, we conjecture that suppressing recording device characteristics in the input spectrogram is the most effective. We propose a frequency-centering operation for spectrograms that improves the ASC performance on unseen recording devices on average by up to 18.2 percentage points.

本文针对机器学习中训练和应用时间所见数据差异这一主要问题，研究了声场分类中不匹配的录音设备情况。我们发现，通过频率方面的规范化进行输入和卷积神经网络中隐藏层激活的处理，可以减少记录设备之间的差异。本文的主要目标是将这种方法应用于成为声场分类主流模型的音频谱图转换器上，并且针对该模型考察了不同的录音设备特征如何被编码到隐藏层激活中。基于这个观察，我们推断出对输入谱图进行抑制可达到最有效的去除记录设备特征的效果。我们提出了一种频率居中的谱图操作，平均提高了未经训练的录音设备上的ASC性能达18.2个百分点。

音频频谱变换器中基于频率归一化的录音设备通用性改进