This paper presents ALO-VC, a non-parallel low-latency one-shot phonetic
posteriorgrams (PPGs) based voice conversion method. ALO-VC enables any-to-any
voice conversion using only one utterance from the target speaker, with only
47.5 ms future look-ahead. The proposed hybrid signal processing and machine
learning pipeline combines a pre-trained speaker encoder, a pitch predictor to
predict the converted speech's prosody, and positional encoding to convey the
phoneme's location information. We introduce two system versions: ALO-VC-R,
which uses a pre-trained d-vector speaker encoder, and ALO-VC-E, which improves
performance using the ECAPA-TDNN speaker encoder. The experimental results
demonstrate both ALO-VC-R and ALO-VC-E can achieve comparable performance to
non-causal baseline systems on the VCTK dataset and two out-of-domain datasets.
Furthermore, both proposed systems can be deployed on a single CPU core with 55
ms latency and 0.78 real-time factor. Our demo is available online.

本文提出了基于语音后验图的非并行低延迟单次语音转换方法 ALO-VC，采用预训练说话人编码器、语调预测器和位置编码器结合的混合信号处理和机器学习管道，提供两个系统版本，均可在单个 CPU 核心上部署并达到与非因果基线系统相当的性能。

ALO-VC：任意低延迟单次语音转换

ALO-VC: Any-to-any Low-latency One-shot Voice Conversion

Dysarthric speech reconstruction (DSR), which aims to improve the quality of
dysarthric speech, remains a challenge, not only because we need to restore the
speech to be normal, but also must preserve the speaker's identity. The speaker
representation extracted by the speaker encoder (SE) optimized for speaker
verification has been explored to control the speaker identity. However, the SE
may not be able to fully capture the characteristics of dysarthric speakers
that are previously unseen. To address this research problem, we propose a
novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA).
The primary task of ASA fine-tunes the SE with the speech of the target
dysarthric speaker to effectively capture identity-related information, and the
secondary task applies adversarial training to avoid the incorporation of
abnormal speaking patterns into the reconstructed speech, by regularizing the
distribution of reconstructed speech to be close to that of reference speech
with high quality. Experiments show that the proposed approach can achieve
enhanced speaker similarity and comparable speech naturalness with a strong
baseline approach. Compared with dysarthric speech, the reconstructed speech
achieves 22.3% and 31.5% absolute word error rate reduction for speakers with
moderate and moderate-severe dysarthria respectively. Our demo page is released
here: this https URL

提出了一种基于 adversarial speaker adaptation 的多任务学习策略，主要任务是 fine-tune 演讲者编码器以有效捕捉身份相关的信息，并通过应用对抗性训练来规范重建语音的分布，以避免引入异常发言模式。结果表明，该方法可在保持语音自然度的同时实现增强的演讲者相似性。