The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.

我们提出了一种更快、更节省内存、性能更好的变压器模型Zipformer，它通过在U-Net类似的编码器结构中进行中间堆栈的操作以较低的帧速率工作，重新组织块结构以提高效率，使用BiasNorm的修改形式来保留一些长度信息，新的激活函数SwooshR和SwooshL的表现优于Swish，通过一个名为ScaledAdam的优化器进行更新尺度的调整，相对变化保持大致相同，并明确学习参数尺度，在LibriSpeech、Aishell-1和WenetSpeech数据集上进行了大量实验，证明了我们提出的Zipformer在与其他最先进的ASR模型相比的有效性。

Zipformer：一种更快、更好的自动语音识别编码器