Deep neural networks have been applied to audio spectrograms for respiratory
sound classification. Existing models often treat the spectrogram as a
synthetic image while overlooking its physical characteristics. In this paper,
a Multi-View Spectrogram Transformer (MVST) is proposed to embed different
views of time-frequency characteristics into the vision transformer.
Specifically, the proposed MVST splits the mel-spectrogram into different sized
patches, representing the multi-view acoustic elements of a respiratory sound.
These patches and positional embeddings are then fed into transformer encoders
to extract the attentional information among patches through a self-attention
mechanism. Finally, a gated fusion scheme is designed to automatically weigh
the multi-view features to highlight the best one in a specific scenario.
Experimental results on the ICBHI dataset demonstrate that the proposed MVST
significantly outperforms state-of-the-art methods for classifying respiratory
sounds.

提出了一种 Multi-View Spectrogram Transformer (MVST) 模型，将深度神经网络应用于呼吸音频谱图的分类，通过将 mel-spectrogram 分为不同大小的 patches，使用 transformer encoders 提取 patches 之间的注意力信息，并设计了一种门控融合机制来加强多视图特征，在呼吸音分类任务上明显优于现有最先进方法。

多视角声谱图变换器用于呼吸音分类

Multi-View Spectrogram Transformer for Respiratory Sound Classification

This paper studies a simple extension of image-based Masked Autoencoders
(MAE) to self-supervised representation learning from audio spectrograms.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first
encodes audio spectrogram patches with a high masking ratio, feeding only the
non-masked tokens through encoder layers. The decoder then re-orders and
decodes the encoded context padded with mask tokens, in order to reconstruct
the input spectrogram. We find it beneficial to incorporate local window
attention in the decoder, as audio spectrograms are highly correlated in local
time and frequency bands. We then fine-tune the encoder with a lower masking
ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art
performance on six audio and speech classification tasks, outperforming other
recent models that use external supervised pre-training. The code and models
will be at this https URL.

本文研究了基于图像的 Masked Autoencoder（MAE）的简单扩展，用于从音频频谱图进行自监督表示学习，并提出了 Audio-MAE 模型，该模型利用 Transformer 编码器 - 解码器设计，使用高掩蔽率编码音频频谱图，通过仅馈送非遮蔽记号通过编码器层，解码器则重新组织和解码编码器产生的上下文，以重构输入谱图。在六个音频和语音分类任务中，Audio-MAE 都表现出最先进的性能，超过了使用外部监督预训练的其他最新模型.

听觉遮盖自编码器

Masked Autoencoders that Listen

The first step in any voice recognition software is to determine what
language a speaker is using, and ideally this process would be automated. The
technique described in this paper, language identification for audio
spectrograms (LIFAS), uses spectrograms generated from audio signals as inputs
to a convolutional neural network (CNN) to be used for language identification.
LIFAS requires minimal pre-processing on the audio signals as the spectrograms
are generated during each batch as they are input to the network during
training.
LIFAS utilizes deep learning tools that are shown to be successful on image
processing tasks and applies it to audio signal classification. LIFAS performs
binary language classification with an accuracy of 97\%, and multi-class
classification with six languages at an accuracy of 89\% on 3.75 second audio
clips.

本文介绍了一种使用卷积神经网络来进行语言识别的技术，即语音频谱的语言识别（LIFAS），它利用由音频信号产生的频谱图作为输入，对语言进行分类识别，达到了 97 删格的二进制语言分类精度和 89% 的六种语言的多类分类精度。