Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as $0.19\%$ absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.

本论文介绍了基于Transformer的在线CTC/Attention E2E ASR架构，该架构包括块自注意力编码器和基于单调截断注意力的自注意力解码器，通过将块自注意力编码器和基于单调截断注意力的自注意力解码器集成到在线CTC/Attention架构中，实现了在线语音识别，与离线基线相比，具有最低为0.19％的CER衰减和显着的性能提升。

基于 Transformer 的 CTC/注意力机制在线端到端语音识别架构