Inference on large language models can be expensive in terms of the compute
and memory costs involved, especially when long sequence lengths are used. In
particular, the self-attention mechanism used in such models contributes
significantly to these costs, which has resulted in several recent works that
propose sparse attention approximations for inference. In this work, we propose
to approximate the self-attention computation by focusing on the dimensionality
of key vectors computed in the attention block. Our analysis reveals that the
key vectors lie in a significantly lower-dimensional space, consistently across
several datasets and models. Exploiting this observation, we propose Loki, a
novel sparse attention method that ranks and selects tokens in the KV-cache
based on attention scores computed in low-dimensional space. Our evaluations
show that Loki is able to maintain the efficacy of the models better than other
popular approximation methods, while speeding up the attention computation due
to reduced data movement (load/store) and compute costs.

我们提出了一种基于低维空间计算注意力的稀疏注意力方法 Loki，该方法在推理阶段可以更好地保持模型的效力，并通过减少数据移动和计算成本来加快注意力计算。

Loki: 用于高效稀疏注意力的低秩键

Loki: Low-Rank Keys for Efficient Sparse Attention

Imitation learning (IL) consists of a set of tools that leverage expert
demonstrations to quickly learn policies. However, if the expert is suboptimal,
IL can yield policies with inferior performance compared to reinforcement
learning (RL). In this paper, we aim to provide an algorithm that combines the
best aspects of RL and IL. We accomplish this by formulating several popular RL
and IL algorithms in a common mirror descent framework, showing that these
algorithms can be viewed as a variation on a single approach. We then propose
LOKI, a strategy for policy learning that first performs a small but random
number of IL iterations before switching to a policy gradient RL method. We
show that if the switching time is properly randomized, LOKI can learn to
outperform a suboptimal expert and converge faster than running policy gradient
from scratch. Finally, we evaluate the performance of LOKI experimentally in
several simulated environments.

本文探讨了一种新的多算法策略，即将多种不同的 RL 和 IL 算法统一到一个 mirror descent 框架下，并提出了名为 LOKI 的基于策略学习的策略，通过 IL 和 RL 的结合可以优于次优专家。