Transformers gain popularity because of their superior prediction accuracy and inference throughput. However, the transformer is computation-intensive, causing a long inference time. The existing work to accelerate transformer inferences has limitations because of the changes to transformer architectures or the need for specialized hardware. In this paper, we identify the opportunities of using memoization to accelerate the attention mechanism in transformers without the above limitation. Built upon a unique observation that there is a rich similarity in attention computation across inference sequences, we build an attention database upon the emerging big memory system. We introduce the embedding technique to find semantically similar inputs to identify computation similarity. We also introduce a series of techniques such as memory mapping and selective memoization to avoid memory copy and unnecessary overhead. We enable 21% performance improvement on average (up to 68%) with the TB-scale attention database and with ignorable loss in inference accuracy.

本研究介绍一种基于缓存优化技术的变压器模型加速方案，通过建立基于大内存系统的注意力数据库来加速注意力计算，从而实现了平均21％的性能提升（最高68％），并且在推理准确性上有可忽略的损失。

大内存系统上的记忆化加速Transformer