Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of future trajectory distributions. A fundamental challenge arises from the inconsistency between the sampled returns within individual trajectories and the expected returns across multiple trajectories. Fortunately, value-based methods offer a solution by leveraging a value function to approximate the expected returns, thereby addressing the inconsistency effectively. Building upon these insights, we propose a novel approach, termed the Critic-Guided Decision Transformer (CGDT), which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. By incorporating a learned value function, known as the critic, CGDT ensures a direct alignment between the specified target returns and the expected returns of actions. This integration bridges the gap between the deterministic nature of RCSL and the probabilistic characteristics of value-based methods. Empirical evaluations on stochastic environments and D4RL benchmark datasets demonstrate the superiority of CGDT over traditional RCSL methods. These results highlight the potential of CGDT to advance the state of the art in offline RL and extend the applicability of RCSL to a wide range of RL tasks.

离线强化学习中的CGDT方法结合了基于值函数的方法和决策Transformer的轨迹建模能力，通过整合学习的值函数，保证了指定目标回报和动作预期回报之间的直接对齐，从而弥合了RCSL的确定性和基于值函数方法的概率特性之间的差距。在随机环境和D4RL基准数据集上进行的实证评估表明，CGDT方法优于传统的RCSL方法，展示了CGDT在离线强化学习领域中提升技术水平并扩展RCSL在广泛强化学习任务中的适用性的潜力。

离线强化学习的评论引导决策转换器