Optimizing the deployment of large language models (LLMs) in edge computing
environments is critical for enhancing privacy and computational efficiency.
Toward efficient wireless LLM inference in edge computing, this study
comprehensively analyzes the impact of different splitting points in mainstream
open-source LLMs. On this basis, this study introduces a framework taking
inspiration from model-based reinforcement learning (MBRL) to determine the
optimal splitting point across the edge and user equipment (UE). By
incorporating a reward surrogate model, our approach significantly reduces the
computational cost of frequent performance evaluations. Extensive simulations
demonstrate that this method effectively balances inference performance and
computational load under varying network conditions, providing a robust
solution for LLM deployment in decentralized settings.

通过模型驱动的强化学习方法，该研究在边缘计算环境中最优化部署大型语言模型，提高隐私和计算效率，减少计算成本，并在分散式环境中实现了推理性能和计算负载的平衡。

边缘计算中无线 LLM 推理的自适应分层切割：基于模型的强化学习方法

Adaptive Layer Splitting for Wireless LLM Inference in Edge Computing: A  Model-Based Reinforcement Learning Approach

The Large Language Model (LLM) is widely employed for tasks such as
intelligent assistants, text summarization, translation, and multi-modality on
mobile phones. However, the current methods for on-device LLM deployment
maintain slow inference speed, which causes poor user experience. To facilitate
high-efficiency LLM deployment on device GPUs, we propose four optimization
techniques: (a) a symbolic expression-based approach to support dynamic shape
model inference; (b) operator optimizations and execution priority setting to
enhance inference speed and reduce phone lagging; (c) an FP4 quantization
method termed M0E4 to reduce dequantization overhead; (d) a sub-tensor-based
technique to eliminate the need for copying KV cache after LLM inference.
Furthermore, we implement these methods in our mobile inference engine,
Transformer-Lite, which is compatible with both Qualcomm and MTK processors. We
evaluated Transformer-Lite's performance using LLMs with varied architectures
and parameters ranging from 2B to 14B. Specifically, we achieved prefill and
decoding speeds of 121 token/s and 14 token/s for ChatGLM2 6B, and 330 token/s
and 30 token/s for smaller Gemma 2B, respectively. Compared with CPU-based
FastLLM and GPU-based MLC-LLM, our engine attains over 10x speedup for the
prefill speed and 2~3x speedup for the decoding speed.

为了在移动设备上高效部署大型语言模型，我们提出了四种优化技术：基于符号表达式的动态模型推断，操作符优化和执行优先级设置，FP4 量化方法以减少反量化开销，以及基于子张量的技术以消除 LLM 推断后的缓存拷贝需求，并利用这些方法实现了移动推断引擎 Transformer-Lite。与 CPU 和 GPU 的其他引擎相比，我们的引擎在填充速度上实现了超过 10 倍的加速，并在解码速度上实现了 2~3 倍的加速。

Transformer-Lite: 在手机 GPU 上高效部署大型语言模型

Transformer-Lite: High-efficiency Deployment of Large Language Models on  Mobile Phone GPUs

Despite the impressive performance of LLMs, their widespread adoption faces
challenges due to substantial computational and memory requirements during
inference. Recent advancements in model compression and system-level
optimization methods aim to enhance LLM inference. This survey offers an
overview of these methods, emphasizing recent developments. Through experiments
on LLaMA(/2)-7B, we evaluate various compression techniques, providing
practical insights for efficient LLM deployment in a unified setting. The
empirical analysis on LLaMA(/2)-7B highlights the effectiveness of these
methods. Drawing from survey insights, we identify current limitations and
discuss potential future directions to improve LLM inference efficiency. We
release the codebase to reproduce the results presented in this paper at
this https URL

调查了 LLMs 的压缩方法和系统级优化方法，提出实验评估结果和改进方向，为高效 LLM 部署提供了实用见解。