Although automatic speech recognition (ASR) can perform well in common
non-overlapping environments, sustaining performance in multi-talker
overlapping speech recognition remains challenging. Recent research revealed
that ASR model's encoder captures different levels of information with
different layers -- the lower layers tend to have more acoustic information,
and the upper layers more linguistic. This inspires us to develop a Sidecar
separator to empower a well-trained ASR model for multi-talker scenarios by
separating the mixed speech embedding between two suitable layers. We
experimented with a wav2vec 2.0-based ASR model with a Sidecar mounted. By
freezing the parameters of the original model and training only the Sidecar
(8.7 M, 8.4% of all parameters), the proposed approach outperforms the previous
state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset,
reaching a word error rate (WER) of 10.36%; and obtains comparable results
(7.56%) for LibriSpeechMix dataset when limited training.

本研究提出了一种基于 Sidecar 分离器的多说话人语音识别方法，旨在提高 ASR 模型在多说话人情况下的识别效果，实验结果显示该方法优于现有最先进水平。

使用旁路分离器将单通道语音识别系统转换为多通道系统

A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One

In this paper, we present end-to-end and speech embedding based systems
trained in a self-supervised fashion to participate in the ACM Multimedia 2022
ComParE Challenge, specifically the stuttering sub-challenge. In particular, we
exploit the embeddings from the pre-trained Wav2Vec2.0 model for stuttering
detection (SD) on the KSoF dataset. After embedding extraction, we benchmark
with several methods for SD. Our proposed self-supervised based SD system
achieves a UAR of 36.9% and 41.0% on validation and test sets respectively,
which is 31.32% (validation set) and 1.49% (test set) higher than the best
(DeepSpectrum) challenge baseline (CBL). Moreover, we show that concatenating
layer embeddings with Mel-frequency cepstral coefficients (MFCCs) features
further improves the UAR of 33.81% and 5.45% on validation and test sets
respectively over the CBL. Finally, we demonstrate that the summing information
across all the layers of Wav2Vec2.0 surpasses the CBL by a relative margin of
45.91% and 5.69% on validation and test sets respectively. Grand-challenge:
Computational Paralinguistics ChallengE

本文提出了基于自监督学习的语音嵌入系统，通过对预先训练的 Wav2Vec2.0 模型进行嵌入提取，结合 Mel 频率倒谱系数 (MFCC) 特征进行评估，在计算语言学竞赛中达到了较好的结果，相对于 DeepSpectrum 挑战基线提高了 31.32% (验证集) 和 1.49% (测试集)。同时，通过对 Wav2Vec2.0 的各层嵌入进行求和，进一步提高了系统性能。

ComParE 2022 口吃子挑战赛的端对端和自监督学习

End-to-End and Self-Supervised Learning for ComParE 2022 Stuttering  Sub-Challenge

Learning to recognize new keywords with just a few examples is essential for
personalizing keyword spotting (KWS) models to a user's choice of keywords.
However, modern KWS models are typically trained on large datasets and
restricted to a small vocabulary of keywords, limiting their transferability to
a broad range of unseen keywords. Towards easily customizable KWS models, we
present KeySEM (Keyword Speech EMbedding), a speech embedding model pre-trained
on the task of recognizing a large number of keywords. Speech representations
offered by KeySEM are highly effective for learning new keywords from a limited
number of examples. Comparisons with a diverse range of related work across
several datasets show that our method achieves consistently superior
performance with fewer training examples. Although KeySEM was pre-trained only
on English utterances, the performance gains also extend to datasets from four
other languages indicating that KeySEM learns useful representations well
aligned with the task of keyword spotting. Finally, we demonstrate KeySEM's
ability to learn new keywords sequentially without requiring to re-train on
previously learned keywords. Our experimental observations suggest that KeySEM
is well suited to on-device environments where post-deployment learning and
ease of customization are often desirable.

KeySEM 是一种基于语音嵌入的关键词识别模型，可用于个性化关键词识别，并能够在有限的示例中高效地学习新关键词，提高关键词识别的性能，这种方法适用于需要在设备上进行学习和自定义的场景。

如何教会关键词探测器在有限的示例下发现新关键词

Teaching keyword spotters to spot new keywords with limited examples

With the rise of low power speech-enabled devices, there is a growing demand
to quickly produce models for recognizing arbitrary sets of keywords. As with
many machine learning tasks, one of the most challenging parts in the model
creation process is obtaining a sufficient amount of training data. In this
paper, we explore the effectiveness of synthesized speech data in training
small, spoken term detection models of around 400k parameters. Instead of
training such models directly on the audio or low level features such as MFCCs,
we use a pre-trained speech embedding model trained to extract useful features
for keyword spotting models. Using this speech embedding, we show that a model
which detects 10 keywords when trained on only synthetic speech is equivalent
to a model trained on over 500 real examples. We also show that a model without
our speech embeddings would need to be trained on over 4000 real examples to
reach the same accuracy.

本文研究使用合成语音数据为小型的口语术语检测模型训练提取有用特征的预训练语音嵌入模型，相较于在 500 个真实示例上训练模型，只使用合成语音即可达到同等精度。