TL;DR介绍了一种利用 Heavy Hitters 实现 KV cache 的新方法,提高了 Large Language Models 在长序列生成任务中的运行性能。
Abstract
large language models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, refe