The Large Language Model (LLM) has gained significant popularity and is
extensively utilized across various domains. Most LLM deployments occur within
cloud data centers, where they encounter substantial response delays and incur
high costs, thereby impacting the Quality of Services (QoS) at the network
edge. Leveraging vector database caching to store LLM request results at the
edge can substantially mitigate response delays and cost associated with
similar requests, which has been overlooked by previous research. Addressing
these gaps, this paper introduces a novel Vector database-assisted cloud-Edge
collaborative LLM QoS Optimization (VELO) framework. Firstly, we propose the
VELO framework, which ingeniously employs vector database to cache the results
of some LLM requests at the edge to reduce the response time of subsequent
similar requests. Diverging from direct optimization of the LLM, our VELO
framework does not necessitate altering the internal structure of LLM and is
broadly applicable to diverse LLMs. Subsequently, building upon the VELO
framework, we formulate the QoS optimization problem as a Markov Decision
Process (MDP) and devise an algorithm grounded in Multi-Agent Reinforcement
Learning (MARL) to decide whether to request the LLM in the cloud or directly
return the results from the vector database at the edge. Moreover, to enhance
request feature extraction and expedite training, we refine the policy network
of MARL and integrate expert demonstrations. Finally, we implement the proposed
algorithm within a real edge system. Experimental findings confirm that our
VELO framework substantially enhances user satisfaction by concurrently
diminishing delay and resource consumption for edge users utilizing LLMs.

本研究提出了一种名为 VELO 框架的向量数据库辅助云边协作的大型语言模型（LLM）的 QoS 优化方法，通过利用向量数据库缓存来降低相似请求的响应时间和成本，并通过多智能体强化学习算法解决 QoS 优化问题。实验结果表明，VELO 框架显著提高了利用 LLM 的边缘用户的用户满意度，同时减少延迟和资源消耗。