Large Language Models (LLMs) have shown promise as intelligent agents in
interactive decision-making tasks. Traditional approaches often depend on
meticulously designed prompts, high-quality examples, or additional reward
models for in-context learning, supervised fine-tuning, or RLHF. Reinforcement
learning (RL) presents a dynamic alternative for LLMs to overcome these
dependencies by engaging directly with task-specific environments. Nonetheless,
it faces significant hurdles: 1) instability stemming from the exponentially
vast action space requiring exploration; 2) challenges in assigning token-level
credit based on action-level reward signals, resulting in discord between
maximizing rewards and accurately modeling corpus data. In response to these
challenges, we introduce Entropy-Regularized Token-level Policy Optimization
(ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the
token level. At the heart of ETPO is our novel per-token soft Bellman update,
designed to harmonize the RL process with the principles of language modeling.
This methodology decomposes the Q-function update from a coarse action-level
view to a more granular token-level perspective, backed by theoretical proof of
optimization consistency. Crucially, this decomposition renders linear time
complexity in action exploration. We assess the effectiveness of ETPO within a
simulated environment that models data science code generation as a series of
multi-step interactive tasks; results show that ETPO achieves effective
performance improvement on the CodeLlama-7B model and surpasses a variant PPO
baseline inherited from RLHF. This underlines ETPO's potential as a robust
method for refining the interactive decision-making capabilities of LLMs.

基于大型语言模型的研究中，引入了一种基于令牌级策略优化的熵调整强化学习方法（ETPO），致力于优化令牌级的语言模型，结果显示 ETPO 在数据科学代码生成任务中取得了良好的性能改进，具有优化交互决策能力的潜力。