Speech contains information that is clinically relevant to some diseases,
which has the potential to be used for health assessment. Recent work shows an
interest in applying deep learning algorithms, especially pretrained large
speech models to the applications of Automatic Speech Assessment. One question
that has not been explored is how these models output the results based on
their inputs. In this work, we train and compare two configurations of Audio
Spectrogram Transformer in the context of Voice Disorder Detection and apply
the attention rollout method to produce model relevance maps, the computed
relevance of the spectrogram regions when the model makes predictions. We use
these maps to analyse how models make predictions in different conditions and
to show that the spread of attention is reduced as a model is finetuned, and
the model attention is concentrated on specific phoneme regions.

训练和比较两种配置下的音频频谱变换器，用于语音障碍检测；应用注意力回传方法生成模型相关性图，分析模型在不同条件下的预测方式，证明随着模型微调，注意力的扩散减少，重点集中在特定音素区域。

自然语言预训练模型在自动语音评估语音障碍中的解释

Interpreting Pretrained Speech Models for Automatic Speech Assessment of  Voice Disorders

Current state-of-the-art audio analysis systems rely on pre-trained embedding
models, often used off-the-shelf as (frozen) feature extractors. Choosing the
best one for a set of tasks is the subject of many recent publications.
However, one aspect often overlooked in these works is the influence of the
duration of audio input considered to extract an embedding, which we refer to
as Temporal Support (TS). In this work, we study the influence of the TS for
well-established or emerging pre-trained embeddings, chosen to represent
different types of architectures and learning paradigms. We conduct this
evaluation using both musical instrument and environmental sound datasets,
namely OpenMIC, TAU Urban Acoustic Scenes 2020 Mobile, and ESC-50. We
especially highlight that Audio Spectrogram Transformer-based systems (PaSST
and BEATs) remain effective with smaller TS, which therefore allows for a
drastic reduction in memory and computational cost. Moreover, we show that by
choosing the optimal TS we reach competitive results across all tasks. In
particular, we improve the state-of-the-art results on OpenMIC, using BEATs and
PaSST without any fine-tuning.

通过研究音频输入时长对现有预训练嵌入模型的影响，本研究发现音频光谱图变换系统在较短的时间支持下仍然有效，从而大大减少了内存和计算成本，同时通过选择最佳时间支持，实现了在所有任务上具有竞争力的结果。

音频分类中最佳时间支持的选择与预训练嵌入

On the choice of the optimal temporal support for audio classification  with Pre-trained embeddings

In the past decade, convolutional neural networks (CNNs) have been widely
adopted as the main building block for end-to-end audio classification models,
which aim to learn a direct mapping from audio spectrograms to corresponding
labels. To better capture long-range global context, a recent trend is to add a
self-attention mechanism on top of the CNN, forming a CNN-attention hybrid
model. However, it is unclear whether the reliance on a CNN is necessary, and
if neural networks purely based on attention are sufficient to obtain good
performance in audio classification. In this paper, we answer the question by
introducing the Audio Spectrogram Transformer (AST), the first
convolution-free, purely attention-based model for audio classification. We
evaluate AST on various audio classification benchmarks, where it achieves new
state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50,
and 98.1% accuracy on Speech Commands V2.

本文介绍了第一种不依赖卷积操作而采用纯自注意力机制的声音分类模型 ——Audio Spectrogram Transformer（AST），在多个音频分类数据集上取得了新的最优结果。