压缩与服务：低开销为数千个LoRA适配器提供服务

Jun, 2024

Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Rickard Brüel-Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald...

TL;DR通过压缩使用共享基础和LoRA特定缩放矩阵对LoRAs进行服务，可以提高吞吐量并保持性能的75%。

Abstract

fine-tuning large language models (LLMs) with low-rank adapters (LoRAs) has become common practice, often yielding numerous copies of the