Automatic speech quality assessment has raised more attention as an alternative or support to traditional perceptual clinical evaluation. However, most research so far only gains good results on simple tasks such as binary classification, largely due to data scarcity. To deal with this challenge, current works tend to segment patients' audio files into many samples to augment the datasets. Nevertheless, this approach has limitations, as it indirectly relates overall audio scores to individual segments. This paper introduces a novel approach where the system learns at the audio level instead of segments despite data scarcity. This paper proposes to use the pre-trained Wav2Vec2 architecture for both SSL, and ASR as feature extractor in speech assessment. Carried out on the HNC dataset, our ASR-driven approach established a new baseline compared with other approaches, obtaining average $MSE=0.73$ and $MSE=1.15$ for the prediction of intelligibility and severity scores respectively, using only 95 training samples. It shows that the ASR based Wav2Vec2 model brings the best results and may indicate a strong correlation between ASR and speech quality assessment. We also measure its ability on variable segment durations and speech content, exploring factors influencing its decision.

自动语音质量评估中，由于数据稀缺，大多数研究仅在二元分类等简单任务上取得良好结果。本文提出了一种新的方法，通过采用预训练的Wav2Vec2架构作为语音评估中的特征提取器，将学习系统从片段级别提升至音频级别，从而建立了一个新的基准，使得只使用95个训练样本可以实现对可懂度和严重程度得分的预测，平均均方误差分别为0.73和1.15。结果表明，基于ASR的Wav2Vec2模型带来了最佳结果，并且可能暗示了ASR与语音质量评估之间的强相关性。同时，我们还评估了该方法在变长片段持续时间和语音内容等因素上的影响。

在数据稀缺环境中利用ASR驱动的Wav2Vec2探索病态语音质量评估