Modern Large Language Models (LLMs) are composed of matrices with billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Being significantly large, such matrices can often be expressed in low-rank format with potential to relax resource requirements. Unlike prior works which focus on developing novel matrix decomposition algorithms, in this work we first study the emergence of low-rank structures across matrices within different layers of LLMs and establish a consequential relationship between the gradient dynamics and emerging low-rank expressiveness of matrices. Our findings reveal that different layers exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. WeLore capitalizes the heavy-tail distribution of singular values to identify a suitable rank reduction ratio for matrices within LLMs. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient perspective and extensive experiments illustrate that LRCs tend to have better finetuning capabilities and can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. For example, finetuning a 50\% compressed LLaMa-2 7B model using only a fraction of parameters in LRCs (WeLore) can outperform its full finetuning with ~3x better throughput and ~0.6x GPU requirement. Our codes are available at \url{https://github.com/VITA-Group/welore}

现代大型语言模型（LLMs）由数十亿个元素组成的矩阵，其存储和处理对计算资源和内存使用非常苛刻，本文研究了在不同层的LLMs内矩阵低秩结构的产生和梯度动态之间的相关性，提出了一种统一的权重低秩投影（WeLore）方法，将权重压缩和内存高效微调融为一体，通过利用奇异值的重尾分布来确定适当的秩降缩放比例，能够显著减少内存和计算资源占用，且低秩组件（LRCs）具有更好的微调能力并能够在性能上接近或超过完全微调的训练损失轨迹和性能。

从GaLore到WeLore：稀疏梯度中低秩权重的非均匀出现