It is challenging to extract semantic meanings directly from audio signals in spoken language understanding (SLU), due to the lack of textual information. Popular end-to-end (E2E) SLU models utilize sequence-to-sequence automatic speech recognition (ASR) models to extract textual embeddings as input to infer semantics, which, however, require computationally expensive auto-regressive decoding. In this work, we leverage self-supervised acoustic encoders fine-tuned with Connectionist Temporal Classification (CTC) to extract textual embeddings and use joint CTC and SLU losses for utterance-level SLU tasks. Experiments show that our model achieves 4% absolute improvement over the the state-of-the-art (SOTA) dialogue act classification model on the DSTC2 dataset and 1.3% absolute improvement over the SOTA SLU model on the SLURP dataset.

本研究通过利用自我监督的声学编码器，提取文本嵌入，结合联合CTC和SLU损失的方法，实现了语音理解任务的话语级SLU模型，并在DSTC2数据集上比SOTA对话行为分类模型提高4％绝对值，在SLURP数据集上比SOTA SLU模型提高1.3％绝对值。

联合CTC损失和自监督预训练声学编码器的端到端口语理解