The careful construction of audio representations has become a dominant feature in the design of approaches to many speech tasks. Increasingly, such approaches have emphasized "disentanglement", where a representation contains only parts of the speech signal relevant to transcription while discarding irrelevant information. In this paper, we construct a representation learning task based on joint modeling of ASR and TTS, and seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We present empirical evidence that successfully finding such a representation is tied to the randomness inherent in training. We then make the observation that these desired, disentangled solutions to the optimization problem possess unique statistical properties. Finally, we show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task. These observations motivate a novel approach to learning effective audio representations.

本研究构建了一种联合建模的声学表征学习任务，强调去耦合（disentanglement）声音信号的相关和无关部分，然后证明这些理想的、去耦合的方案具有独特的统计性质，并在训练期间强制执行这些性质，使平均 WER 相对提高了 24.5％，这提出了一种新的有效的音频表示的学习方法。

朝向解缠语音表示