Large Language Models~(LLMs) have become foundational in the realm of natural
language processing, demonstrating performance improvements as model sizes
increase. The Mixture-of-Experts~(MoE) approach offers a promising way to scale
LLMs more efficiently by using fewer computational FLOPs through sparse
activation. However, it suffers from significant memory overheads,
necessitating model compression techniques. Post-training quantization, a
popular method for model compression, proves less effective when directly
applied to MoE models due to MoE's overlooked inherent sparsity. This paper
explores several MoE structure-aware quantization heuristics, ranging from
coarse to fine granularity, from MoE block to individual linear weight. Our
investigations reveal critical principles: different MoE structures (i.e.,
blocks, experts, linear layers) require varying numbers of weight bits for
effective and efficient quantization. Conclusions are supported by extensive
benchmarking across two representative MoE models and six tasks. We further
introduce novel enhancements to more accurately identify the most critical
weights in MoE quantization that necessitate higher bit allocations, including
the linear weight outlier scorer and MoE block scorer. Additionally, subsequent
experiments validate our findings in the context of both weight and activation
quantization.

大型语言模型的研究中，Mixture-of-Experts（MoE）方法通过稀疏激活以更少的计算 FLOPs 实现了对 LLMs 的有效扩展，但是由于显著的内存开销，在直接应用于 MoE 模型时，后训练量化的常规方法效果较差。本文所做工作在多个维度对 MoE 结构感知的量化启发式方法进行了探索，包括从粗到细粒度、从 MoE 块到单个线性权重。研究结果揭示了关键原则：不同的 MoE 结构（如块、专家、线性层）需要不同数量的权重位数以实现有效和高效的量化。通过对两个典型 MoE 模型和六个任务进行广泛的基准测试来支持所得结论。同时，我们还引入了一些新的增强方法，以更准确地识别 MoE 量化中需要更高位数分配的关键权重，包括线性权重异常得分器和 MoE 块得分器。此外，后续实验还验证了我们在权重和激活量化方面的发现。

混合专家后训练量化的研究：一个基准评估

Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark

We propose a new variant of the Adam optimizer [Kingma and Ba, 2014] called
MICROADAM that specifically minimizes memory overheads, while maintaining
theoretical convergence guarantees. We achieve this by compressing the gradient
information before it is fed into the optimizer state, thereby reducing its
memory footprint significantly. We control the resulting compression error via
a novel instance of the classical error feedback mechanism from distributed
optimization [Seide et al., 2014, Alistarh et al., 2018, Karimireddy et al.,
2019] in which the error correction information is itself compressed to allow
for practical memory gains. We prove that the resulting approach maintains
theoretical convergence guarantees competitive to those of AMSGrad, while
providing good practical performance. Specifically, we show that MICROADAM can
be implemented efficiently on GPUs: on both million-scale (BERT) and
billion-scale (LLaMA) models, MicroAdam provides practical convergence
competitive to that of the uncompressed Adam baseline, with lower memory usage
and similar running time. Our code is available at
this https URL

提出了一种名为 MICROADAM 的 Adam 优化器新变种，它专门最小化内存开销，同时保持理论收敛性保证。通过在优化器状态之前压缩梯度信息来显著减少内存占用。使用分布式优化中的经典错误反馈机制来控制压缩误差，并实现实际的内存收益。证明了这种方法具有与 AMSGrad 相媲美的理论收敛性保证，并提供良好的实际性能。在 GPU 上有效实现的 MICROADAM 在百万级（BERT）和十亿级（LLaMA）模型上，与未压缩的 Adam 基准相比，提供了实用的竞争性收敛性，并具有更低的内存使用和类似的运行时间。

MicroAdam：精确的自适应优化方法，低空间开销和可证明收敛性

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and  Provable Convergence

In the evolving landscape of neural network models, one prominent challenge
stand out: the significant memory overheads associated with training expansive
models. Addressing this challenge, this study delves deep into the Rotated
Tensor Parallelism (RTP). RTP is an innovative approach that strategically
focuses on memory deduplication in distributed training environments. It boasts
of unique features like a customized communication primitive and the Flyweight
Pattern initialization. Furthermore, RTP ensures a seamless overlap between
partition computation and partition weight communication, optimizing the
training process. Our empirical evaluations underscore RTP's efficiency,
revealing that its memory consumption during distributed system training is
remarkably close to the optimal - distributing the memory overhead of a single
machine equitably among multiple machines. The experimental results demonstrate
that RTP is capable of achieving comparable performance to Distributed Data
Parallel while providing support for significantly larger models with
near-linear scalability in terms of memory. Code of RTP is available at
this https URL

该研究深入探讨了旋转张量并行性（RTP），这是一种创新的方法，针对训练大规模模型中的显著内存开销进行了战略性的内存去重，并优化了训练过程。实证评估结果表明，RTP 在分布式系统训练过程中的内存消耗与最优解非常接近，并且能够实现与分布式数据并行相当的性能，同时支持显著更大的模型。