We propose a framework to learn semantics from raw audio signals using two types of representations, encoding contextual and phonetic information respectively. Specifically, we introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions. For the language model, we adopt a dual-channel architecture to incorporate both types of representation. We also present new training objectives, masked context reconstruction and masked context prediction, that push models to learn semantics effectively. Experiments on the sSIMI metric of Zero Resource Speech Benchmark 2021 and Fluent Speech Command dataset show our framework learns semantics better than models trained with only one type of representation.

我们提出了一种框架，使用两种类型的表示，分别编码上下文和语音信息，从原始音频信号中学习语义。通过引入一种语音到单元处理流程，以不同的时间分辨率捕捉两种类型的表示。对于语言模型，我们采用双通道架构来结合这两种表示。我们还提出了新的训练目标，即掩蔽上下文重建和掩蔽上下文预测，可以有效地推动模型学习语义。在Zero Resource Speech Benchmark 2021和流畅语音命令数据集上的实验证明，我们的框架比只使用一种类型表示训练的模型更好地学习语义。

利用语境和音素表示从原始音频信号中学习语义信息