Conversational text-to-speech (TTS) aims to synthesize speech with proper
prosody of reply based on the historical conversation. However, it is still a
challenge to comprehensively model the conversation, and a majority of
conversational TTS systems only focus on extracting global information and omit
local prosody features, which contain important fine-grained information like
keywords and emphasis. Moreover, it is insufficient to only consider the
textual features, and acoustic features also contain various prosody
information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal
conversational text-to-speech system, aiming to comprehensively utilize
historical conversation and enhance prosodic expression. More specifically, we
design a textual context module and an acoustic context module with both
coarse-grained and fine-grained modeling. Experimental results demonstrate that
our model mixed with fine-grained context information and additionally
considering acoustic features achieves better prosody performance and
naturalness in CMOS tests.

提出了一种多尺度，多模态会话文本到语音系统（M2-CTTS），用于综合利用历史会话并增强韵律表达，通过考虑文本和声学因素的粗粒度和细粒度建模，并混合细粒度上下文信息及声学特征，实现了更好的韵律表现和自然度。

M2-CTTS: 端到端的多尺度、多模态会话文本到语音合成

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational  Text-to-Speech Synthesis

Non-parallel multi-domain voice conversion (VC) is a technique for learning
mappings among multiple domains without relying on parallel data. This is
important but challenging owing to the requirement of learning multiple
mappings and the non-availability of explicit supervision. Recently, StarGAN-VC
has garnered attention owing to its ability to solve this problem only using a
single generator. However, there is still a gap between real and converted
speech. To bridge this gap, we rethink conditional methods of StarGAN-VC, which
are key components for achieving non-parallel multi-domain VC in a single
model, and propose an improved variant called StarGAN-VC2. Particularly, we
rethink conditional methods in two aspects: training objectives and network
architectures. For the former, we propose a source-and-target conditional
adversarial loss that allows all source domain data to be convertible to the
target domain data. For the latter, we introduce a modulation-based conditional
method that can transform the modulation of the acoustic feature in a
domain-specific manner. We evaluated our methods on non-parallel multi-speaker
VC. An objective evaluation demonstrates that our proposed methods improve
speech quality in terms of both global and local structure measures.
Furthermore, a subjective evaluation shows that StarGAN-VC2 outperforms
StarGAN-VC in terms of naturalness and speaker similarity. The converted speech
samples are provided at
this http URL

本研究提出了一种改进的条件方法 StarGAN-VC2，包含源域和目标域的条件对抗损失和基于调制的条件方法，来使多域语音转换更加准确和自然。实验结果显示，该方法在语音质量和说话人相似度方面优于之前的 StarGAN-VC 模型。