We investigate robustness properties of pre-trained neural models for automatic speech recognition. Real life data in machine learning is usually very noisy and almost never clean, which can be attributed to various factors depending on the domain, e.g. outliers, random noise and adversarial noise. Therefore, the models we develop for various tasks should be robust to such kinds of noisy data, which led to the thriving field of robust machine learning. We consider this important issue in the setting of automatic speech recognition. With the increasing popularity of pre-trained models, it's an important question to analyze and understand the robustness of such models to noise. In this work, we perform a robustness analysis of the pre-trained neural models wav2vec2, HuBERT and DistilHuBERT on the LibriSpeech and TIMIT datasets. We use different kinds of noising mechanisms and measure the model performances as quantified by the inference time and the standard Word Error Rate metric. We also do an in-depth layer-wise analysis of the wav2vec2 model when injecting noise in between layers, enabling us to predict at a high level what each layer learns. Finally for this model, we visualize the propagation of errors across the layers and compare how it behaves on clean versus noisy data. Our experiments conform the predictions of Pasad et al. [2021] and also raise interesting directions for future work.

本文研究了预训练神经模型在自动语音识别中的鲁棒性，并对wav2vec2，HuBERT和DistilHuBERT进行了鲁棒性分析，发现它们在LibriSpeech和TIMIT数据集上对噪声的鲁棒性不同，同时进行了层次分析以预测每层的学习，通过误差传播和对比清晰和嘈杂的数据，验证了Pasad等人的预测，并提出未来研究的有趣方向。

自动语音识别端到端神经模型的鲁棒性分析