Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables streaming decoding and can extract frame-level acoustic feature, facilitating fine-grained information fusion. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$\%$ and 17.2$\%$ in character error rate (CER) across multi accent test datasets on KeSpeech and MagicData-RMAC.

通过提出一种名为Qifusion-Net的层自适应融合模型，我们可以在无需任何关于目标口音的先验知识的情况下，有效地识别多口音语音，并通过动态块策略实现流式解码，提取帧级声学特征，促进了精细的信息融合，实验结果表明，我们的方法在KeSpeech和MagicData-RMAC的多口音测试数据集上相对于基准模型分别降低了22.1%和17.2%的字符错误率（CER）

Qifusion-Net: 面向端到端的多口音语音识别的层自适应流/非流模型