Parameter-efficient finetuning (PEFT) is a widely used technique to adapt
large language models for different tasks. Service providers typically create
separate systems for users to perform PEFT model finetuning and inference
tasks. This is because existing systems cannot handle workloads that include a
mix of inference and PEFT finetuning requests. As a result, shared GPU
resources are underutilized, leading to inefficiencies. To address this
problem, we present FlexLLM, the first system that can serve inference and
parameter-efficient finetuning requests in the same iteration. Our system
leverages the complementary nature of these two tasks and utilizes shared GPU
resources to run them jointly, using a method called co-serving. To achieve
this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks
down the finetuning computation of a sequence into smaller token-level
computations and uses dependent parallelization and graph pruning, two static
compilation optimizations, to minimize the memory overhead and latency for
co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces
the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory
requirement of finetuning by up to 36% while maintaining a low inference
latency and improving finetuning throughput. For example, under a heavy
inference workload, FlexLLM can still preserve more than 80% of the peak
finetuning throughput, whereas existing systems cannot make any progress with
finetuning. The source code of FlexLLM is publicly available at
this https URL

我们提出了 FlexLLM，这是第一个能够在同一次迭代中处理推理和参数高效微调请求的系统，通过协同服务的方法，利用共享的 GPU 资源来同时运行这两个任务，FlexLLM 的合作服务方法减少了激活 GPU 内存开销高达 8 倍，并将微调的整个 GPU 内存要求降低了最多 36％，同时保持了低推理延迟并提高了微调吞吐量。

FlexLLM: 用于共同服务大型语言模型推理和参数高效调优的系统

FlexLLM: A System for Co-Serving Large Language Model Inference and  Parameter-Efficient Finetuning

The growing demand for Large Language Models (LLMs) in applications such as
content generation, intelligent chatbots, and sentiment analysis poses
considerable challenges for LLM service providers. To efficiently use GPU
resources and boost throughput, batching multiple requests has emerged as a
popular paradigm; to further speed up batching, LLM quantization techniques
reduce memory consumption and increase computing capacity. However, prevalent
quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully
leverage the capabilities of modern GPUs, such as 4-bit integer operators,
resulting in sub-optimal performance.
To maximize LLMs' serving throughput, we introduce Atom, a low-bit
quantization method that achieves high throughput improvements with negligible
accuracy loss. Atom significantly boosts serving throughput by using low-bit
operators and considerably reduces memory consumption via low-bit quantization.
It attains high accuracy by applying a novel mixed-precision and fine-grained
quantization process. We evaluate Atom on 4-bit weight-activation quantization
setups in the serving context. Atom improves end-to-end throughput by up to
$7.73\times$ compared to the FP16 and by $2.53\times$ compared to INT8
quantization, while maintaining the same latency target.

Atom 是一种低位量化方法，通过使用低位算子和低位量化显著提高 serving 吞吐量以及减少内存消耗，同时保持相同的延迟目标。

Atom：高效准确的 LLM 服务器的低比特量化

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

This paper aims to explore the potential of combining Deep Reinforcement
Learning (DRL) with Knowledge Distillation (KD) by distilling various DRL
algorithms and studying their distillation effects. By doing so, the
computational burden of deep models could be reduced while maintaining the
performance. The primary objective is to provide a benchmark for evaluating the
performance of different DRL algorithms that have been refined using KD
techniques. By distilling these algorithms, the goal is to develop efficient
and fast DRL models. This research is expected to provide valuable insights
that can facilitate further advancements in this promising direction. By
exploring the combination of DRL and KD, this work aims to promote the
development of models that require fewer GPU resources, learn more quickly, and
make faster decisions in complex environments. The results of this research
have the capacity to significantly advance the field of DRL and pave the way
for the future deployment of resource-efficient, decision-making intelligent
systems.

通过挖掘深度强化学习（Deep Reinforcement Learning，DRL）与知识蒸馏（Knowledge Distillation，KD）相结合的潜力，本文通过蒸馏各种 DRL 算法并研究其蒸馏效果的方式，旨在减少深度模型的计算负担，在保持性能的同时实现高效与快速。研究目标是提供一个用于评估使用 KD 技术优化的不同 DRL 算法性能的基准。通过蒸馏这些算法，旨在开发高效和快速的 DRL 模型。此研究有望提供有价值的见解，促进这个有前途的领域的进一步发展。通过探索 DRL 和 KD 的结合，本研究旨在推动不仅需要较少 GPU 资源，还能在复杂环境中更快学习并做出更快决策的模型的发展。该研究的结果有能力显著推动 DRL 领域的发展，并为未来部署资源高效的决策智能系统铺平道路。

在资源受限环境下利用知识蒸馏提升高效深度强化学习

Leveraging Knowledge Distillation for Efficient Deep Reinforcement  Learning in Resource-Constrained Environments

Large Language Models (LLMs) have revolutionized Natural Language Processing
(NLP) but demand massive GPU resources for training. Lowering the threshold for
LLMs training would encourage greater participation from researchers,
benefiting both academia and society. While existing approaches have focused on
parameter-efficient fine-tuning, which tunes or adds a small number of
parameters, few have addressed the challenge of tuning the full parameters of
LLMs with limited resources. In this work, we propose a new optimizer,
LOw-Memory Optimization (LOMO), which fuses the gradient computation and the
parameter update in one step to reduce memory usage. By integrating LOMO with
existing memory saving techniques, we reduce memory usage to 10.8% compared to
the standard approach (DeepSpeed solution). Consequently, our approach enables
the full parameter fine-tuning of a 65B model on a single machine with 8 RTX
3090, each with 24GB memory.

提出了一种名为 LOw-Memory Optimization（LOMO）的新优化器，该优化器将梯度计算和参数更新融合为一步，可以在单台机器上通过充分利用记忆方案使大型语言模型（LLMs）进行全参数微调的训练过程中降低内存使用，并成功地在一台装有 8 个 RTX 3090 的机器上对一个拥有 65B 参数的模型进行充分微调。