We propose a data cleansing method that utilizes a neural analysis and synthesis (NANSY++) framework to train an end-to-end neural diarization model (EEND) for singer diarization. Our proposed model converts song data with choral singing which is commonly contained in popular music and unsuitable for generating a simulated dataset to the solo singing data. This cleansing is based on NANSY++, which is a framework trained to reconstruct an input non-overlapped audio signal. We exploit the pre-trained NANSY++ to convert choral singing into clean, non-overlapped audio. This cleansing process mitigates the mislabeling of choral singing to solo singing and helps the effective training of EEND models even when the majority of available song data contains choral singing sections. We experimentally evaluated the EEND model trained with a dataset using our proposed method using annotated popular duet songs. As a result, our proposed method improved 14.8 points in diarization error rate.

我们提出了一种数据清洗方法，利用神经分析和合成（NANSY++）框架训练了一种端到端神经分离模型（EEND）用于歌手分离。我们的方法通过利用预训练的NANSY++将合唱声变为干净、非重叠的音频信号，从而减轻了合唱声对独唱声的错误标注，并在可用的歌曲数据大部分包含合唱段落的情况下，帮助有效训练EEND模型。通过对带有注释的流行二重唱歌曲数据集进行实验证明，我们的方法使分离误差率提高了14.8个百分点。

利用神经分析和合成框架进行端到端神经歌手消声的歌曲数据清洗