An important and difficult task in code-switched speech recognition is to
recognize the language, as lots of words in two languages can sound similar,
especially in some accents. We focus on improving performance of end-to-end
Automatic Speech Recognition models by conditioning transformer layers on
language ID of words and character in the output in an per layer supervised
manner. To this end, we propose two methods of introducing language specific
parameters and explainability in the multi-head attention mechanism, and
implement a Temporal Loss that helps maintain continuity in input alignment.
Despite being unable to reduce WER significantly, our method shows promise in
predicting the correct language from just spoken data. We introduce
regularization in the language prediction by dropping LID in the sequence,
which helps align long repeated output sequences.

通过条件化变换器层上的语言 ID，我们提出了两种引入语言特定参数和可解释性以及实现辅助性的时间损失方法，以改进端到端自动语音识别模型的性能。尽管无法显著减少词错误率，但我们的方法在通过仅仅的口语数据预测正确语言方面表现出潜力。我们通过在序列中删除语言 ID 来引入语言预测的正则化，有助于对齐长重复的输出序列。

使用语言模型的语言切换语音识别：古吉拉特英语

Gujarati-English Code-Switching Speech Recognition using ensemble  prediction of spoken language

We investigate the emergent abilities of the recently proposed web-scale
speech model Whisper, by adapting it to unseen tasks with prompt engineering.
We selected three tasks: audio-visual speech recognition (AVSR), code-switched
speech recognition (CS-ASR), and speech translation (ST) on unseen language
pairs. We design task-specific prompts, by either leveraging another
large-scale model, or simply manipulating the special tokens in the default
prompts. Experiments show that compared to the default prompts, our proposed
prompts improve performance by 10% to 45% on the three zero-shot tasks, and
even outperform SotA supervised models on some datasets. In addition, our
experiments reveal many interesting properties of Whisper, including its
robustness to prompts, bias on accents, and the multilingual understanding in
its latent space. Code is available at
this https URL

本文通过调整 Prompt 的方式，从三个任务 (音视频语音识别、混合语音识别、语音翻译) 入手，探究了该模型 Whisper 的应用性能。实验证明，相对于默认 Prompt，本文提出的 Prompt 在零 - shot 任务上的表现提升了 10% 到 45%，并在一些数据集上甚至超越了 SotA 监督模型。此外，实验还揭示了 Whisper 的许多有趣属性，例如其对提示的鲁棒性、对语音口音的偏见，以及在潜在空间中的多语言理解。