Spoken language recognition (SLR) is the task of automatically identifying
the language present in a speech signal. Existing SLR models are either too
computationally expensive or too large to run effectively on devices with
limited resources. For real-world deployment, a model should also gracefully
handle unseen languages outside of the target language set, yet prior work has
focused on closed-set classification where all input languages are known
a-priori. In this paper we address these two limitations: we explore efficient
model architectures for SLR based on convolutional networks, and propose a
multilabel training strategy to handle non-target languages at inference time.
Using the VoxLingua107 dataset, we show that our models obtain competitive
results while being orders of magnitude smaller and faster than current
state-of-the-art methods, and that our multilabel strategy is more robust to
unseen non-target languages compared to multiclass classification.

本文主要讲述了如何利用卷积神经网络构建高效的口语语种识别模型，并在训练时采用多标签的方式来应对非目标语种的识别，实验结果表明，该模型相比当前最先进的方法在速度上有数量级的提升，并且在多标签分类任务中更加稳健。

通过多标签分类实现有效的口语语言识别

Efficient Spoken Language Recognition via Multilabel Classification

Spoken language recognition (SLR) refers to the automatic process used to
determine the language present in a speech sample. SLR is an important task in
its own right, for example, as a tool to analyze or categorize large amounts of
multi-lingual data. Further, it is also an essential tool for selecting
downstream applications in a work flow, for example, to chose appropriate
speech recognition or machine translation models. SLR systems are usually
composed of two stages, one where an embedding representing the audio sample is
extracted and a second one which computes the final scores for each language.
In this work, we approach the SLR task as a detection problem and implement the
second stage as a probabilistic linear discriminant analysis (PLDA) model. We
show that discriminative training of the PLDA parameters gives large gains with
respect to the usual generative training. Further, we propose a novel
hierarchical approach where two PLDA models are trained, one to generate scores
for clusters of highly-related languages and a second one to generate scores
conditional to each cluster. The final language detection scores are computed
as a combination of these two sets of scores. The complete model is trained
discriminatively to optimize a cross-entropy objective. We show that this
hierarchical approach consistently outperforms the non-hierarchical one for
detection of highly related languages, in many cases by large margins. We train
our systems on a collection of datasets including over 100 languages, and test
them both on matched and mismatched conditions, showing that the gains are
robust to condition mismatch.

本文介绍了一种基于概率线性判别分析模型的口语语言识别方法，该模型通过提取音频样本的嵌入向量得出语音信号的语言，并基于层次方法和最大熵准则进行训练，结果表明该识别方法可以用于高度相关的语言中并具有鲁棒性。

一种基于层次判别式 PLDA 的语音识别模型

A Discriminative Hierarchical PLDA-based Model for Spoken Language Recognition

This paper investigates the use of automatically collected web audio data for
the task of spoken language recognition. We generate semi-random search phrases
from language-specific Wikipedia data that are then used to retrieve videos
from YouTube for 107 languages. Speech activity detection and speaker
diarization are used to extract segments from the videos that contain speech.
Post-filtering is used to remove segments from the database that are likely not
in the given language, increasing the proportion of correctly labeled segments
to 98%, based on crowd-sourced verification. The size of the resulting training
set (VoxLingua107) is 6628 hours (62 hours per language on the average) and it
is accompanied by an evaluation set of 1609 verified utterances. We use the
data to build language recognition models for several spoken language
identification tasks. Experiments show that using the automatically retrieved
training data gives competitive results to using hand-labeled proprietary
datasets. The dataset is publicly available.

本文研究了利用网络音频数据自动识别口语语言的任务。通过从特定语言的 Wikipedia 数据中生成半随机搜索短语，并从 YouTube 中检索视频来提取具有语音的视频片段，并使用语音活动检测和说话人分离提取包含语音的视频片段，最终构建了可用于多种口语识别任务的语言识别模型，自动检索的数据结果优于使用手工标记的专有数据集。