Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparametrized models remains poorly understood, and alternative approaches do not necessarily make it cheaper to train high-performance models. In this paper, we explore low-rank training techniques as an alternative approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to pre-training transformer language models with up to 350M parameters and demonstrate comparable performance to regular neural network training. Furthermore, we observe that the efficiency of ReLoRA increases with model size, making it a promising approach for training multi-billion-parameter networks efficiently. Our findings shed light on the potential of low-rank training techniques and their implications for scaling laws.

本文探讨了低秩训练技术作为训练大型神经网络的替代方法，介绍了一种名为ReLoRA的新方法，并将其应用于多达350M参数的预训练transformer语言模型的训练，并证明了与常规神经网络训练相当的性能。与此同时，我们发现ReLoRA的效率随着模型大小的增加而增加，这使其成为高效训练多十亿参数网络的有前途的方法。我们的发现揭示了低秩训练技术的潜力及其对缩放定律的影响。

不同方式叠加更多层：通过低秩更新进行高秩训练