Matrices are exceptionally useful in various fields of study as they provide
a convenient framework to organize and manipulate data in a structured manner.
However, modern matrices can involve billions of elements, making their storage
and processing quite demanding in terms of computational resources and memory
usage. Although prohibitively large, such matrices are often approximately low
rank. We propose an algorithm that exploits this structure to obtain a low rank
decomposition of any matrix $\mathbf{A}$ as $\mathbf{A} \approx
\mathbf{L}\mathbf{R}$, where $\mathbf{L}$ and $\mathbf{R}$ are the low rank
factors. The total number of elements in $\mathbf{L}$ and $\mathbf{R}$ can be
significantly less than that in $\mathbf{A}$. Furthermore, the entries of
$\mathbf{L}$ and $\mathbf{R}$ are quantized to low precision formats $--$
compressing $\mathbf{A}$ by giving us a low rank and low precision
factorization. Our algorithm first computes an approximate basis of the range
space of $\mathbf{A}$ by randomly sketching its columns, followed by a
quantization of the vectors constituting this basis. It then computes
approximate projections of the columns of $\mathbf{A}$ onto this quantized
basis. We derive upper bounds on the approximation error of our algorithm, and
analyze the impact of target rank and quantization bit-budget. The tradeoff
between compression ratio and approximation accuracy allows for flexibility in
choosing these parameters based on specific application requirements. We
empirically demonstrate the efficacy of our algorithm in image compression,
nearest neighbor classification of image and text embeddings, and compressing
the layers of LlaMa-$7$b. Our results illustrate that we can achieve
compression ratios as aggressive as one bit per matrix coordinate, all while
surpassing or maintaining the performance of traditional compression
techniques.

我们提出一种算法，利用矩阵的低秩结构来获得任意矩阵的低秩分解，通过向量量化和压缩技术实现了压缩比例和逼近精度之间的折衷。

通过随机低秩和低精度因式分解实现矩阵压缩

Matrix Compression via Randomized Low Rank and Low Precision  Factorization

Low Rank Decomposition of matrix - splitting a large matrix into a product of
two smaller matrix offers a means for compression that reduces the parameters
of a model without sparsification, and hence delivering more speedup on modern
hardware. Moreover, unlike quantization, the compressed linear layers remain
fully differentiable and all the parameters trainable, while being able to
leverage the existing highly efficient kernels over floating point matrices. We
study the potential to compress Large Language Models (LLMs) for monolingual
Code generation via Low Rank Decomposition (LoRD) and observe that ranks for
the linear layers in these models can be reduced by upto 39.58% with less than
1% increase in perplexity. We then use Low Rank Decomposition (LoRD) to
compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with
minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single
A100. The compressed models speeds up inference by up to 22.35% with just a
single line of change in code over huggingface's implementation with pytorch
backend. Low Rank Decomposition (LoRD) models remain compatible with state of
the art near-lossless quantization method such as SpQR, which allows leveraging
further compression gains of quantization. Lastly, QLoRA over Low Rank
Decomposition (LoRD) model further reduces memory requirements by as much as
21.2% over vanilla QLoRA while offering similar gains from parameter efficient
fine tuning. Our work shows Low Rank Decomposition (LoRD) as a promising new
paradigm for LLM compression.

通过 Low Rank Decomposition (LoRD) 来压缩大型语言模型（LLMs）以及用于单语代码生成，能够大幅减少参数，提供速度提升，并且保持可微分性和可训练性，且与现有高效浮点矩阵内核兼容，具备潜力提高模型压缩效果。

LORD：单语代码 LLM 的低秩分解用于一次性压缩

LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot  Compression

Compression of a neural network can help in speeding up both the training and
the inference of the network. In this research, we study applying compression
using low rank decomposition on network layers. Our research demonstrates that
to acquire a speed up, the compression methodology should be aware of the
underlying hardware as analysis should be done to choose which layers to
compress. The advantage of our approach is demonstrated via a case study of
compressing ResNet50 and training on full ImageNet-ILSVRC2012. We tested on two
different hardware systems Nvidia V100 and Huawei Ascend910. With hardware
targeted compression, results on Ascend910 showed 5.36% training speedup and
15.79% inference speed on Ascend310 with only 1% drop in accuracy compared to
the original uncompressed model

通过使用低秩分解在网络层上应用压缩，本研究旨在研究压缩神经网络以提高训练和推理速度。我们的研究证明，为了加速，压缩方法应该考虑底层硬件，并进行分析以选择要压缩的层。通过对 ResNet50 的压缩和在全图像数据集 ImageNet-ILSVRC2012 上的训练的案例研究，我们展示了我们的方法的优势。我们在两个不同的硬件系统 Nvidia V100 和 Huawei Ascend910 上进行了测试。通过针对硬件进行压缩，Ascend910 上的训练加速度为 5.36%，Ascend310 上的推理速度为 15.79%，与原始未压缩模型相比仅有 1% 的精度下降。

通过层级目标低秩分解加速 Resnet 架构

Speeding up Resnet Architecture with Layers Targeted Low Rank  Decomposition

Low Rank Decomposition (LRD) is a model compression technique applied to the
weight tensors of deep learning models in order to reduce the number of
trainable parameters and computational complexity. However, due to high number
of new layers added to the architecture after applying LRD, it may not lead to
a high training/inference acceleration if the decomposition ranks are not small
enough. The issue is that using small ranks increases the risk of significant
accuracy drop after decomposition. In this paper, we propose two techniques for
accelerating low rank decomposed models without requiring to use small ranks
for decomposition. These methods include rank optimization and sequential
freezing of decomposed layers. We perform experiments on both convolutional and
transformer-based models. Experiments show that these techniques can improve
the model throughput up to 60% during training and 37% during inference when
combined together while preserving the accuracy close to that of the original
models

通过优化秩和顺序冻结分解层，本文提出的两种技术能够在保持准确度不变的前提下，提高模型的训练和推理速度达到 60% 和 37%。