While significant improvements have been made in recent years in terms of end-to-end automatic speech recognition (ASR) performance, such improvements were obtained through the use of very large neural networks, unfit for embedded use on edge devices. That being said, in this paper, we work on simplifying and compressing Transformer-based encoder-decoder architectures for the end-to-end ASR task. We empirically introduce a more compact Speech-Transformer by investigating the impact of discarding particular modules on the performance of the model. Moreover, we evaluate reducing the numerical precision of our network's weights and activations while maintaining the performance of the full-precision model. Our experiments show that we can reduce the number of parameters of the full-precision model and then further compress the model 4x by fully quantizing to 8-bit fixed point precision.

本研究通过探究去除特定模块的影响以及减少神经网络的数字精度的方法，成功地简化和压缩了基于Transformer编码器-解码器的端到端语音识别架构，实验结果表明，我们能够通过将数字精度减少到8位定点精度，将全精度模型的参数数量减小并将模型进一步压缩4倍，同时维持模型高精度。

一种简化了的全量化Transformer用于端到端语音识别