In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with a monotonic RNN-T loss well-suited to frame-synchronous, streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model achieves competitive performance compared to existing LibriSpeech benchmarks for attention-based models trained with cross-entropy loss. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.

该论文提出了一种端到端的语音识别模型，使用Transformer编码器可用于流媒体语音识别系统；该模型在LibriSpeech数据集上进行了实验结果，结果表明限制Transformer层中自注意力左侧上下文对于流式解码是可行的，并展示了我们的全注意力模型在LibriSpeech基准测试上的准确性优于现有技术水平。

Transformer Transducer：基于Transformer编码器和RNN-T Loss的可流式语音识别模型