Aki Kunikoshi, Jaebok Kim, Wonsuk Jun, Kåre Sjölander
TL;DR本研究比较了自监督学习特征和谱特征的性能,并结合了两者以提高自动 MOS 的准确性。使用大规模听力测试语料库,发现 wav2vec 特征具有最佳的泛化能力,且结合特征组合表现最佳。
Abstract
Automatic methods to predict mean opinion score (MOS) of listeners have been
researched to assure the quality of text-to-speech systems. Many previous
studies focus on architectural advances (e.g. MBNet, LDNet, e