Large-scale deployment of generative AI tools often depends on costly API
calls to a Large Language Model (LLM) to fulfil user queries. To curtail the
frequency of these calls, one can employ a smaller language model -- a student
-- which is continuously trained on the responses of the LLM. This student
gradually gains proficiency in independently handling an increasing number of
user requests, a process we term neural caching. The crucial element in neural
caching is a policy that decides which requests should be processed by the
student alone and which should be redirected to the LLM, subsequently aiding
the student's learning. In this study, we focus on classification tasks, and we
consider a range of classic active learning-based selection criteria as the
policy. Our experiments suggest that Margin Sampling and Query by Committee
bring consistent benefits across tasks and budgets.

大规模部署生成式 AI 工具常依赖于昂贵的 API 调用以满足用户查询。为了节省这些调用的频率，可以使用一个较小的语言模型 - 学生 - 它会持续训练以适应 LLM 的响应。这个学生逐渐增强独立处理用户请求的能力，这个过程我们称之为神经缓存。神经缓存的关键因素是决定哪些请求应由学生单独处理，哪些请求应重定向到 LLM 以辅助学生学习的策略。在这项研究中，我们关注分类任务，并将一系列经典的基于主动学习的选择标准作为策略进行考虑。我们的实验证明，边界采样和委员会查询在任务和预算方面都带来持续的好处。