The effectiveness of Large Language Models (LLMs) in solving tasks vastly depends on the quality of the instructions, which often require fine-tuning through extensive human effort. This highlights the need for automated instruction optimization; however, this optimization is particularly challenging when dealing with black-box LLMs, where model parameters and gradients remain inaccessible. We propose ACING, a task-specific prompt optimization approach framed as a stateless continuous-action Reinforcement Learning (RL) problem, known as the continuum bandit setting. ACING leverages an actor-critic-based method to optimize prompts, learning from non-differentiable reward signals. We validate ACING by optimizing prompts for ChatGPT on 30 instruction-based tasks. ACING consistently outperforms baseline methods, achieving a median score improvement of 10 percentage points. Furthermore, ACING not only recovers but also surpasses human-crafted expert instructions, achieving up to a 39 percentage point improvement against human benchmarks.

本研究解决了大型语言模型（LLMs）指令优化的难题，特别是在黑箱情况下缺乏模型参数和梯度的可用信息。提出了一个创新的演员-评论家基于强化学习的方法（ACING），能够从非可微的奖励信号中学习并优化提示。实验结果表明，ACING在30个指令性任务中的表现超过了基线方法，具有高达39个百分点的提升，展现了其潜在的广泛影响。

ACING：在黑箱大型语言模型中用于指令学习的演员-评论家方法