Disentangled representation learning from speech remains limited despite its
importance in many application domains. A key challenge is the lack of speech
datasets with known generative factors to evaluate methods. This paper proposes
SynSpeech: a novel synthetic speech dataset with ground truth factors enabling
research on disentangling speech representations. We plan to present a
comprehensive study evaluating supervised techniques using established
supervised disentanglement metrics. This benchmark dataset and framework
address the gap in the rigorous evaluation of state-of-the-art disentangled
speech representation learning methods. Our findings will provide insights to
advance this underexplored area and enable more robust speech representations.

利用综合研究人工数据集 SynSpeech 来评估监督技术在语音表征解耦上的效果，弥补有限的语音数据集缺乏已知生成因素的问题，为现有最先进的语音表征学习方法提供全面的评估和框架，进一步推动这个相对较少探索的领域的发展。

学习解缠绕语音表示

Learning Disentangled Speech Representations

MOS (Mean Opinion Score) is a subjective method used for the evaluation of a
system's quality. Telecommunications (for voice and video), and speech
synthesis systems (for generated speech) are a few of the many applications of
the method. While MOS tests are widely accepted, they are time-consuming and
costly since human input is required. In addition, since the systems and
subjects of the tests differ, the results are not really comparable. On the
other hand, a large number of previous tests allow us to train machine learning
models that are capable of predicting MOS value. By automatically predicting
MOS values, both the aforementioned issues can be resolved.
The present work introduces data-, training- and post-training specific
improvements to a previous self-supervised learning-based MOS prediction model.
We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and
non-linear dense layers. We introduced transfer learning, target data
preprocessing a two- and three-phase training method with different batch
formulations, dropout accumulation (for larger batch sizes) and quantization of
the predictions.
The methods are evaluated using the shared synthetic speech dataset of the
first Voice MOS challenge.

该研究通过对先前基于自监督学习的 MOS 预测模型进行数据、训练和后训练的特定改进，并采用多种技术评估其有效性，包括 wav2vec 2.0 模型、转移学习、不同的批处理方法和方法的量化等，实现自动预测 MOS 值。