Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using $>$100x less training compute.

本研究解决了现有语音助手在音频与文本建模中信息丢失和复杂性增加的问题。我们提出了一种新的训练方式，使用文本仅模型的响应作为自我监督，有效地消除了对标注响应的需求。研究表明，该蒸馏语音助手（DiVA）在回答问题、分类和翻译等任务中表现出色，并在用户偏好上超越了现有最先进模型，显示出巨大的潜在影响。

无需指导训练数据的端到端语音助手的蒸馏