Accurately detecting dysfluencies in spoken language can help to improve the
performance of automatic speech and language processing components and support
the development of more inclusive speech and language technologies. Inspired by
the recent trend towards the deployment of large language models (LLMs) as
universal learners and processors of non-lexical inputs, such as audio and
video, we approach the task of multi-label dysfluency detection as a language
modeling problem. We present hypotheses candidates generated with an automatic
speech recognition system and acoustic representations extracted from an audio
encoder model to an LLM, and finetune the system to predict dysfluency labels
on three datasets containing English and German stuttered speech. The
experimental results show that our system effectively combines acoustic and
lexical information and achieves competitive results on the multi-label
stuttering detection task.

通过将自动语音识别系统生成的假设候选项和从音频编码模型提取的声学表示输入到大型语言模型（LLMs）中，我们将多标签异味检测任务作为一种语言建模问题进行了研究，并在包含英语和德语结巴语音的三个数据集上对系统进行了优化，以预测异味标签，实验结果表明我们的系统有效地结合声学和词法信息，在多标签结巴检测任务上取得了有竞争力的结果。

大型语言模型用于口吃语音的错断检测

Large Language Models for Dysfluency Detection in Stuttered Speech

Use of speech models for automatic speech processing tasks can improve
efficiency in the screening, analysis, diagnosis and treatment in medicine and
psychiatry. However, the performance of pre-processing speech tasks like
segmentation and diarization can drop considerably on in-the-wild clinical
data, specifically when the target dataset comprises of atypical speech. In
this paper we study the performance of a pre-trained speech model on a dataset
comprising of child-clinician conversations in Danish with respect to the
classification threshold. Since we do not have access to sufficient labelled
data, we propose few-instance threshold adaptation, wherein we employ the first
minutes of the speech conversation to obtain the optimum classification
threshold. Through our work in this paper, we learned that the model with
default classification threshold performs worse on children from the patient
group. Furthermore, the error rates of the model is directly correlated to the
severity of diagnosis in the patients. Lastly, our study on few-instance
adaptation shows that three-minutes of clinician-child conversation is
sufficient to obtain the optimum classification threshold.

本文研究在野外临床数据中使用预训练语音模型进行分割和分辨，提出了少实例阈值适应方法，并发现默认分类阈值下的模型在患者群体中表现较差，错误率与患者病情的严重程度直接相关，而研究表明三分钟的临床医生 - 儿童对话足以获得最佳分类阈值。

针对低资源且野外环境下的丹麦儿童 - 临床家谈话的语音检测：一个案例研究

Speech Detection For Child-Clinician Conversations In Danish For Low-Resource In-The-Wild Conditions: A Case Study

Recent years have seen an increasing number of studies around the design of
computer-assisted interpreting tools with integrated automatic speech
processing and their use by trainees and professional interpreters. This paper
discusses the role of system latency of such tools and presents the results of
an experiment designed to investigate the maximum system latency that is
cognitively acceptable for interpreters working in the simultaneous modality.
The results show that interpreters can cope with a system latency of 3 seconds
without any major impact in the rendition of the original text, both in terms
of accuracy and fluency. This value is above the typical latency of available
AI-based CAI tools and paves the way to experiment with larger context-based
language models and higher latencies.

本研究探讨了计算机辅助口译工具的系统延迟对译员的认知影响，结果表明译员可以在 3 秒的延迟下进行同时口译，这一结果高于目前可用人工智能技术的典型延迟并为研究更高延迟的基于语境的语言模型提供了前提。