End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness when text representation quality is low due to ASR transcription errors. To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. We show accuracy improvements on STOP dataset and share the analysis to demonstrate the effectiveness of our approach.

提出了一种增强对ASR错误鲁棒性的新型端到端（E2E）口语理解（SLU）系统，通过基于ASR假设的估计模态置信度融合音频和文本表示，来解决E2E SLU系统在文本表示质量低时的问题，并通过在STOP数据集上的准确性改进和分析来证明我们的方法的有效性。

模态可信度感知的鲁棒端到端口语理解训练