Large language models (LLMs) have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients, enabling substantial memory savings without sacrificing performance. Recent works further extend GaLore from various aspects, including low-bit quantization and higher-order tensor structures. However, there are several remaining challenges for GaLore, such as the computational overhead of SVD for subspace updates and the integration with state-of-the-art training parallelization strategies (e.g., FSDP). In this paper, we present GaLore 2, an efficient and scalable GaLore framework that addresses these challenges and incorporates recent advancements. In addition, we demonstrate the scalability of GaLore 2 by pre-training Llama 7B from scratch using up to 500 billion training tokens, highlighting its potential impact on real LLM pre-training scenarios.

本研究解决了大语言模型在训练过程中面临的显著内存瓶颈问题。通过梯度低秩投影，GaLore 2 提供了一种高效且可扩展的框架，克服了SVD计算开销及与先进训练并行化策略整合的挑战。研究表明，GaLore 2可通过高达5000亿个训练标记从零开始预训练Llama 7B，展现了其在现实LLM预训练场景中的潜在影响。

GaLore 2：通过梯度低秩投影进行大规模LLM预训练