In this paper, we propose a pipeline to find the number of speakers, as well
as audios belonging to each of these now identified speakers in a source of
audio data where number of speakers or speaker labels are not known a priori.
We used this approach as a part of our Data Preparation
本文研究了利用网络音频数据自动识别口语语言的任务。通过从特定语言的 Wikipedia 数据中生成半随机搜索短语,并从 YouTube 中检索视频来提取具有语音的视频片段,并使用语音活动检测和说话人分离提取包含语音的视频片段,最终构建了可用于多种口语识别任务的语言识别模型,自动检索的数据结果优于使用手工标记的专有数据集。