In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Edit" (LCE), a novel multimodal sound mixture editor that modifies each sound source in a mixture based on user-provided text instructions. LCE distinguishes itself with a user-friendly chat interface and its unique ability to edit multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for editing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles it into the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse editing tasks like extraction, removal, and volume control. Our experiments demonstrate significant improvements in signal quality across all editing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.

我们工作引入了一种新颖的多模态声音混合编辑器'LCE'，它根据用户提供的文本指令修改混合中的每个声源。LCE通过用户友好的聊天界面和其独特的能力在混合中同时编辑多个声源，无需将它们分离。用户输入自由词汇文本提示，由大型语言模型解释以创建语义过滤器来编辑声音混合。系统将混合分解为组成部分，应用语义过滤器，并将其重新组装成期望的输出。我们开发了一个160小时的数据集，包括100k个混合物，包括语音和各种音频源，以及用于不同编辑任务（如提取、删除和音量控制）的文本提示。我们的实验证明，在所有编辑任务中信号质量有显著提高，并且在不同数量和类型的声源的零-shot场景中表现稳健。

倾听、交谈与编辑：文本引导下的音景修改以提升听觉体验