Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. Traditional ASR systems often overlook the interplay between these events, focusing solely on content, even though the interpretation of dialogue can vary with environmental context. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We introduce a new task, continual event detection from speech, for which we also provide two benchmark datasets. To address the challenges of catastrophic forgetting and effective disentanglement, we propose a novel method, 'Double Mixture.' This method merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting. Our comprehensive experiments show that this task presents significant challenges that are not effectively addressed by current state-of-the-art methods in either computer vision or natural language processing. Our approach achieves the lowest rates of forgetting and the highest levels of generalization, proving robust across various continual learning sequences. Our code and data are available at https://anonymous.4open.science/status/Continual-SpeechED-6461.

这篇论文介绍了关于语音事件检测的两个主要挑战：在不遗忘以前的事件的情况下持续整合新事件，以及语义与声学事件的分离。为了解决这些挑战，作者提出了一种新的任务——从语音中进行持续事件检测，并提供了两个基准数据集。他们提出的“双重混合”方法将语音专业知识与强大的记忆机制相结合，以提高适应性并防止遗忘。实验结果表明，这项任务在当前计算机视觉和自然语言处理领域的最新方法中仍存在非常大的挑战。该方法在各种连续学习序列中具有最低的遗忘率和最高的泛化水平，具有鲁棒性。相关的代码和数据可以在此https URL获得。

双重混合: 实现从语音中的连续事件检测