Automatic detection and severity level classification of dysarthria directly
from acoustic speech signals can be used as a tool in medical diagnosis. In
this work, the pre-trained wav2vec 2.0 model is studied as a feature extractor
to build detection and severity level classification systems for dysarthric
speech. The experiments were carried out with the popularly used UA-speech
database. In the detection experiments, the results revealed that the best
performance was obtained using the embeddings from the first layer of the
wav2vec model that yielded an absolute improvement of 1.23% in accuracy
compared to the best performing baseline feature (spectrogram). In the studied
severity level classification task, the results revealed that the embeddings
from the final layer gave an absolute improvement of 10.62% in accuracy
compared to the best baseline features (mel-frequency cepstral coefficients).

通过使用预训练的 wav2vec 2.0 模型作为特征提取器，本研究对发音困难症的声学语音信号进行了自动检测和严重程度分类，结果表明使用 wav2vec 模型的第一层嵌入特征在准确度上相较于基线特征（声谱图）提升了 1.23%，在严重程度分类任务中最终层的嵌入特征相较于基线特征（梅尔频率倒谱系数）提升了 10.62%。

基于 Wav2vec 的言语智能识别与严重程度分类 —— 以口吃为例

Wav2vec-based Detection and Severity Level Classification of Dysarthria  from Speech

Estimating dimensional emotions, such as activation, valence and dominance,
from acoustic speech signals has been widely explored over the past few years.
While accurate estimation of activation and dominance from speech seem to be
possible, the same for valence remains challenging. Previous research has shown
that the use of lexical information can improve valence estimation performance.
Lexical information can be obtained from pre-trained acoustic models, where the
learned representations can improve valence estimation from speech. We
investigate the use of pre-trained model representations to improve valence
estimation from acoustic speech signal. We also explore fusion of
representations to improve emotion estimation across all three emotion
dimensions: activation, valence and dominance. Additionally, we investigate if
representations from pre-trained models can be distilled into models trained
with low-level features, resulting in models with a less number of parameters.
We show that fusion of pre-trained model embeddings result in a 79% relative
improvement in concordance correlation coefficient CCC on valence estimation
compared to standard acoustic feature baseline (mel-filterbank energies), while
distillation from pre-trained model embeddings to lower-dimensional
representations yielded a relative 12% improvement. Such performance gains were
observed over two evaluation sets, indicating that our proposed architecture
generalizes across those evaluation sets. We report new state-of-the-art
"text-free" acoustic-only dimensional emotion estimation $CCC$ values on two
MSP-Podcast evaluation sets.

本研究探讨利用预训练的声学模型，将词汇信息融合到声学语音信号中，以改善情感估计，特别是情感维度中的愉悦度估计，并且发现预训练的模型嵌入融合可比标准声学特征基线（Mel 滤波器的能量）产生更好的效果，且经测试可以推广到其他数据集上。