Transformer-based language models spread FLOPs uniformly across input
sequences. In this work we demonstrate that transformers can instead learn to
dynamically allocate FLOPs (or compute) to specific positions in a sequence,
optimising the allocation along the sequence for different layers across the
model depth. Our method enforces a total compute budget by capping the number
of tokens ($k$) that can participate in the self-attention and MLP computations
at a given layer. The tokens to be processed are determined by the network
using a top-$k$ routing mechanism. Since $k$ is defined a priori, this simple
procedure uses a static computation graph with known tensor sizes, unlike other
conditional computation techniques. Nevertheless, since the identities of the
$k$ tokens are fluid, this method can expend FLOPs non-uniformly across the
time and model depth dimensions. Thus, compute expenditure is entirely
predictable in sum total, but dynamic and context-sensitive at the token-level.
Not only do models trained in this way learn to dynamically allocate compute,
they do so efficiently. These models match baseline performance for equivalent
FLOPS and wall-clock times to train, but require a fraction of the FLOPs per
forward pass, and can be upwards of 50\% faster to step during post-training
sampling.

本文通过动态分配计算资源到序列的特定位置，优化了模型深度中不同层级的计算分配，从而实现了对计算资源的灵活分配和预测性控制。这种方法在保持总计算预算的同时，能够以非均匀的方式在时间和模型深度维度上高效地分配计算资源，并且以相当于基准性能的情况下，大大减少了每次前向传播所需的计算量，提高了后期采样的速度。