This paper investigates a method for simulating natural conversation in the
model training of end-to-end neural diarization (EEND). Due to the lack of any
annotated real conversational dataset, EEND is usually pretrained on a
large-scale simulated conversational dataset first and then adapted to the
target real dataset. Simulated datasets play an essential role in the training
of EEND, but as yet there has been insufficient investigation into an optimal
simulation method. We thus propose a method to simulate natural conversational
speech. In contrast to conventional methods, which simply combine the speech of
multiple speakers, our method takes turn-taking into account. We define four
types of speaker transition and sequentially arrange them to simulate natural
conversations. The dataset simulated using our method was found to be
statistically similar to the real dataset in terms of the silence and overlap
ratios. The experimental results on two-speaker diarization using the CALLHOME
and CSJ datasets showed that the simulated dataset contributes to improving the
performance of EEND.

研究了一种用于模拟自然对话的方法，用于训练端到端神经音频分离技术（EEND），并提出了一种考虑交替对话方式的模拟方法，用于在缺乏真实语音数据的情况下，优化 EEND 模型的训练。通过对 CALLHOME 和 CSJ 数据集的实验结果表明，使用这种方法模拟数据有助于提高 EEND 的性能表现。

提高端到端神经分离模型模拟对话自然度

Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization

Speaker diarization systems are challenged by a trade-off between the
temporal resolution and the fidelity of the speaker representation. By
obtaining a superior temporal resolution with an enhanced accuracy, a
multi-scale approach is a way to cope with such a trade-off. In this paper, we
propose a more advanced multi-scale diarization system based on a multi-scale
diarization decoder. There are two main contributions in this study that
significantly improve the diarization performance. First, we use multi-scale
clustering as an initialization to estimate the number of speakers and obtain
the average speaker representation vector for each speaker and each scale.
Next, we propose the use of 1-D convolutional neural networks that dynamically
determine the importance of each scale at each time step. To handle a variable
number of speakers and overlapping speech, the proposed system can estimate the
number of existing speakers. Our proposed system achieves a state-of-art
performance on the CALLHOME and AMI MixHeadset datasets, with 3.92% and 1.05%
diarization error rates, respectively.

本研究提出基于多尺度解码器的高级多尺度语者分离系统，通过多尺度聚类初始化估计讲话人数和每个尺度的平均发言者表示向量，使用 1-D 卷积神经网络动态决定每个时间步长上每个尺度的重要性，抑制了时间分辨率和发言者表示保真度之间的平衡问题。该系统可以估计存在的说话人数和在 CALLHOME 和 AMI MixHeadset 数据集上实现了业界领先的性能，分别为 3.92% 和 1.05% 的对白错误率。