Self-supervised learning (SSL) speech representation models, trained on large
speech corpora, have demonstrated effectiveness in extracting hierarchical
speech embeddings through multiple transformer layers. However, the behavior of
these embeddings in specific tasks remains uncertain. This paper investigates
the multi-layer behavior of the WavLM model in anti-spoofing and proposes an
attentive merging method to leverage the hierarchical hidden embeddings.
Results demonstrate the feasibility of fine-tuning WavLM to achieve the best
equal error rate (EER) of 0.65%, 3.50%, and 3.19% on the ASVspoof 2019LA,
2021LA, and 2021DF evaluation sets, respectively. Notably, We find that the
early hidden transformer layers of the WavLM large model contribute
significantly to anti-spoofing task, enabling computational efficiency by
utilizing a partial pre-trained model.

本文研究了 WavLM 模型在反欺诈任务中的多层行为，并提出了一种注意力融合方法来利用分层隐藏嵌入，结果表明微调 WavLM 能够在 ASVspoof 2019LA、2021LA 和 2021DF 的评估集上分别达到 0.65%、3.50% 和 3.19% 的最佳等错误率，值得注意的是，我们发现 WavLM 大模型的早期隐藏 Transformer 层对反欺诈任务有显著贡献，并通过使用部分预训练模型实现了计算效率。

用于反欺骗检测的预训练语音模型中隐藏嵌入的关注性合并

Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for  Anti-spoofing Detection

Recent work on speech representation models jointly pre-trained with text has
demonstrated the potential of improving speech representations by encoding
speech and text in a shared space. In this paper, we leverage such shared
representations to address the persistent challenge of limited data
availability in spoken language understanding tasks. By employing a pre-trained
speech-text model, we find that models fine-tuned on text can be effectively
transferred to speech testing data. With as little as 1 hour of labeled speech
data, our proposed approach achieves comparable performance on spoken language
understanding tasks (specifically, sentiment analysis and named entity
recognition) when compared to previous methods using speech-only pre-trained
models fine-tuned on 10 times more data. Beyond the proof-of-concept study, we
also analyze the latent representations. We find that the bottom layers of
speech-text models are largely task-agnostic and align speech and text
representations into a shared space, while the top layers are more
task-specific.

通过使用预训练的语音 - 文本模型，本研究发现只需 1 小时标注的语音数据，即可与仅使用 10 倍数据的仅语音预训练模型在口语理解任务（情感分析和命名实体识别）上取得可比较的性能；同时发现底层的语音 - 文本模型作为任务自主层面，在共享空间中对齐语音和文本表示，而顶层则更加任务特定。

基于联合语音 - 文本模型的小样本语音理解

Few-Shot Spoken Language Understanding via Joint Speech-Text Models

Recently proposed self-supervised learning approaches have been successful
for pre-training speech representation models. The utility of these learned
representations has been observed empirically, but not much has been studied
about the type or extent of information encoded in the pre-trained
representations themselves. Developing such insights can help understand the
capabilities and limits of these models and enable the research community to
more efficiently develop their usage for downstream applications. In this work,
we begin to fill this gap by examining one recent and successful pre-trained
model (wav2vec 2.0), via its intermediate representation vectors, using a suite
of analysis tools. We use the metrics of canonical correlation, mutual
information, and performance on simple downstream tasks with non-parametric
probes, in order to (i) query for acoustic and linguistic information content,
(ii) characterize the evolution of information across model layers, and (iii)
understand how fine-tuning the model for automatic speech recognition (ASR)
affects these observations. Our findings motivate modifying the fine-tuning
protocol for ASR, which produces improved word error rates in a low-resource
setting.

本研究使用一套分析工具研究一款较新的波形自编码预训练语音表征模型，发现其中间表征向量所包含的声学信息和语言信息内容，并研究了自动语音识别（ASR）微调对这些观察结果产生的影响，为此提出了一个修改方案，并证明其在低资源设置中提高了单词错误率的表现。