With the development of transformer-based large language models (LLMs), they
have been applied to many fields due to their remarkable utility, but this
comes at a considerable computational cost at deployment. Fortunately, some
methods such as pruning or constructing a mixture of experts (MoE) aim at
exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in
speed and reduction in memory requirements. However, these techniques can be
very costly and inflexible in practice, as they often require training or are
restricted to specific types of architectures. To address this, we introduce
GRIFFIN, a novel training-free MoE that selects unique FF experts at the
sequence level for efficient generation across a plethora of LLMs with
different non-ReLU activation functions. This is possible due to a critical
observation that many trained LLMs naturally produce highly structured FF
activation patterns within a sequence, which we call flocking. Despite our
method's simplicity, we show with 50\% of the FF parameters, GRIFFIN maintains
the original model's performance with little to no degradation on a variety of
classification and generation tasks, all while improving latency (e.g.
1.25$\times$ speed-up in Llama 2 13B on an NVIDIA L40). Code will be available
at this https URL

GRIFFIN 是一种在不同非 ReLU 激活函数的大规模语言模型 (LLMs) 中选择唯一的前馈 (FE) 专家以实现高效生成的新型无需训练的 MoE 方法。

基于提示 - 提示的专家混合模型进行高效 LLM 生成

Prompt-prompted Mixture of Experts for Efficient LLM Generation

Recurrent neural networks (RNNs) have fast inference and scale efficiently on
long sequences, but they are difficult to train and hard to scale. We propose
Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that
mixes gated linear recurrences with local attention. Hawk exceeds the reported
performance of Mamba on downstream tasks, while Griffin matches the performance
of Llama-2 despite being trained on over 6 times fewer tokens. We also show
that Griffin can extrapolate on sequences significantly longer than those seen
during training. Our models match the hardware efficiency of Transformers
during training, and during inference they have lower latency and significantly
higher throughput. We scale Griffin up to 14B parameters, and explain how to
shard our models for efficient distributed training.

基于门控线性递归的 Hawk 和将门控线性递归与局部注意力相结合的混合模型 Griffin 在效率方面超过了 Mamba 和 Llama-2，在训练和推理阶段都具有更高的硬件效率，并且可以有效地进行分布式训练。