Audio event has a hierarchical architecture in both time and frequency and
can be grouped together to construct more abstract semantic audio classes. In
this work, we develop a multiscale audio spectrogram Transformer (MAST) that
employs hierarchical representation learning for efficient audio
classification. Specifically, MAST employs one-dimensional (and
two-dimensional) pooling operators along the time (and frequency domains) in
different stages, and progressively reduces the number of tokens and increases
the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast}
by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound
in terms of the top-1 accuracy without external training data. On the
downloaded AudioSet dataset, which has over 20\% missing audios, MAST also
achieves slightly better accuracy than AST. In addition, MAST is 5x more
efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the
number of parameters compared to AST. Through clustering metrics and
visualizations, we demonstrate that the proposed MAST can learn semantically
more separable feature representations from audio signals.

这篇研究提出了一种名为多尺度音频谱变换器（MAST）的方法，它采用分层表示学习来提高音频分类的效率，相较于 AST，MAST 在没有外部训练数据的情况下，在 Kinetics-Sounds，Epic-Kitchens-100 和 VGGSound 数据集上的准确度提高了 22.2％、4.4％和 4.7％，同时比 AST 更加高效。

多尺度音频光谱变换器用于有效的音频分类

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Recently, neural networks based purely on self-attention, such as the Vision
Transformer (ViT), have been shown to outperform deep learning models
constructed with convolutional neural networks (CNNs) on various vision tasks,
thus extending the success of Transformers, which were originally developed for
language processing, to the vision domain. A recent study showed that a similar
methodology can also be applied to the audio domain. Specifically, the Audio
Spectrogram Transformer (AST) achieves state-of-the-art results on various
audio classification benchmarks. However, pure Transformer models tend to
require more training data compared to CNNs, and the success of the AST relies
on supervised pretraining that requires a large amount of labeled data and a
complex training pipeline, thus limiting the practical usage of AST.
This paper focuses on audio and speech classification, and aims to reduce the
need for large amounts of labeled data for AST by leveraging self-supervised
learning using unlabeled data. Specifically, we propose to pretrain the AST
model with joint discriminative and generative masked spectrogram patch
modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We
evaluate our pretrained models on both audio and speech classification tasks
including audio event classification, keyword spotting, emotion recognition,
and speaker identification. The proposed self-supervised framework
significantly boosts AST performance on all tasks, with an average improvement
of 60.9%, leading to similar or even better results than a supervised
pretrained AST. To the best of our knowledge, it is the first patch-based
self-supervised learning framework in the audio and speech domain, and also the
first self-supervised learning framework for AST.

本文提出了一种使用未标注数据进行自监督预训练的方法，使用联合判别式和生成式掩蔽频谱补丁建模对 AST 模型进行预训练，从而显著提高音频分类性能。这是音频领域中第一个基于补丁的自监督学习框架，也是 AST 的自监督学习框架的首次探索。