We adapt the architectures of previous audio manipulation and generation
neural networks to the task of real-time any-to-one voice conversion. Our
resulting model, LLVC ($\textbf{L}$ow-latency $\textbf{L}$ow-resource
$\textbf{V}$oice $\textbf{C}$onversion), has a latency of under 20ms at a
bitrate of 16kHz and runs nearly 2.8x faster than real-time on a consumer CPU.
LLVC uses both a generative adversarial architecture as well as knowledge
distillation in order to attain this performance. To our knowledge LLVC
achieves both the lowest resource usage as well as the lowest latency of any
open-source voice conversion model. We provide open-source samples, code, and
pretrained model weights at this https URL

我们将之前的音频处理和生成神经网络的结构应用于实时的多对一声音转换任务中，得到了具有低延迟和低资源使用的模型 LLVC（低延迟低资源声音转换），在 16kHz 比特率下延迟不到 20 毫秒，在消费级 CPU 上运行速度接近实时的 2.8 倍。LLVC 采用了生成对抗网络和知识蒸馏的结构来实现这种性能，据我们所知，LLVC 是开源声音转换模型中资源使用和延迟最低的。我们在 https://URL 提供开源样本、代码和预训练模型权重。

CPU 上的低延迟实时语音转换

Low-latency Real-time Voice Conversion on CPU

Thanks to recent advancements in end-to-end speech modeling technology, it
has become increasingly feasible to imitate and clone a user`s voice. This
leads to a significant challenge in differentiating between authentic and
fabricated audio segments. To address the issue of user voice abuse and misuse,
the second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and
analyze deepfake speech utterances. Specifically, Track 2, named the
Manipulation Region Location (RL), aims to pinpoint the location of manipulated
regions in audio, which can be present in both real and generated audio
segments. We propose our novel TranssionADD system as a solution to the
challenging problem of model robustness and audio segment outliers in the trace
competition. Our system provides three unique contributions: 1) we adapt
sequence tagging task for audio deepfake detection; 2) we improve model
generalization by various data augmentation techniques; 3) we incorporate
multi-frame detection (MFD) module to overcome limited representation provided
by a single frame and use isolated-frame penalty (IFP) loss to handle outliers
in segments. Our best submission achieved 2nd place in Track 2, demonstrating
the effectiveness and robustness of our proposed system.

本研究提出了一种 TranssionADD 系统，通过序列标签任务和 MFD 模块结合多种数据增强技术改进模型能力，使用 IFP 损失函数和处理 segment 异常值，有效解决了检测深度伪造语音 utterance 的难题。