Much recent work on Spoken Language Understanding (SLU) is limited in at least one of three ways: models were trained on oracle text input and neglected ASR errors, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. In this paper, we propose a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech to address these issues. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU data. We study two semi-supervised settings for the ASR component: supervised pretraining on transcribed speech, and unsupervised pretraining by replacing the ASR encoder with self-supervised speech representations, such as wav2vec. In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation. Experiments on ATIS show that our SLU framework with speech as input can perform on par with those using oracle text as input in semantics understanding, even though environmental noise is present and a limited amount of labeled semantics data is available for training.

本文提出了一种基于半监督学习的、使用预先训练的端到端自动语音识别（E2E ASR）和自监督语言模型（如BERT）进行微调的通用语义理解框架，该框架可从转录或未转录的语音中直接学习语义来解决一些SLU模型中的问题，如ASR错误、意图预测而不是词槽预测以及在大量训练数据不足的情况下训练。实验结果表明，该框架对于语义理解可以与使用Oracle文本作为输入的模型相媲美，具有良好的环境噪声鲁棒性，并且在训练集有限的情况下也能达到较好的效果。

自监督语音和语言模型预训练的半监督口语理解