TL;DR提出一种名为 LongFNT 的架构,通过融合句子级别和标记级别的长时序特征和预训练的 RoBERTa 上下文编码器,扩展了长段音频输入的自动语音识别模型,显著降低了字错率。
Abstract
Traditional automatic speech recognition~(ASR) systems usually focus on
individual utterances, without considering long-form speech with useful
historical information, which is more practical in real scenarios. S