Tian Jin, Wanzin Yazar, Zifei Xu, Sayeh Sharify, Xin Wang
TL;DR训练大型语言模型自我选择注意力跨度可以加快解决现实世界任务的自回归推理速度。
Abstract
large language models (LLMs) can solve challenging tasks. However, their
inference computation on modern GPUs is highly inefficient due to the
increasing number of tokens they must attend to as they generate new