Offline reinforcement learning (RL) has seen notable advancements through
return-conditioned supervised learning (RCSL) and value-based methods, yet each
approach comes with its own set of practical challenges. Addressing these, we
propose Value-Aided Conditional Supervised Learning (VCS), a method that
effectively synergizes the stability of RCSL with the stitching ability of
value-based methods. Based on the Neural Tangent Kernel analysis to discern
instances where value function may not lead to stable stitching, VCS injects
the value aid into the RCSL's loss function dynamically according to the
trajectory return. Our empirical studies reveal that VCS not only significantly
outperforms both RCSL and value-based methods but also consistently achieves,
or often surpasses, the highest trajectory returns across diverse offline RL
benchmarks. This breakthrough in VCS paves new paths in offline RL, pushing the
limits of what can be achieved and fostering further innovations.

通过结合 Neural Tangent Kernel 分析的价值函数，Value-Aided Conditional Supervised Learning (VCS) 方法有效地克服了 return-conditioned supervised learning (RCSL) 和 value-based methods 所面临的实际挑战。实证研究表明，VCS 不仅明显优于 RCSL 和 value-based methods，而且在各种离线强化学习基准测试中始终达到甚至超过最高的轨迹回报，为离线强化学习打开了新的道路，并推动了进一步的创新。

价值增益的条件监督学习用于离线强化学习

Value-Aided Conditional Supervised Learning for Offline RL

Recent advancements in offline reinforcement learning (RL) have underscored
the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm
that learns the action distribution based on target returns for each state in a
supervised manner. However, prevailing RCSL methods largely focus on
deterministic trajectory modeling, disregarding stochastic state transitions
and the diversity of future trajectory distributions. A fundamental challenge
arises from the inconsistency between the sampled returns within individual
trajectories and the expected returns across multiple trajectories.
Fortunately, value-based methods offer a solution by leveraging a value
function to approximate the expected returns, thereby addressing the
inconsistency effectively. Building upon these insights, we propose a novel
approach, termed the Critic-Guided Decision Transformer (CGDT), which combines
the predictability of long-term returns from value-based methods with the
trajectory modeling capability of the Decision Transformer. By incorporating a
learned value function, known as the critic, CGDT ensures a direct alignment
between the specified target returns and the expected returns of actions. This
integration bridges the gap between the deterministic nature of RCSL and the
probabilistic characteristics of value-based methods. Empirical evaluations on
stochastic environments and D4RL benchmark datasets demonstrate the superiority
of CGDT over traditional RCSL methods. These results highlight the potential of
CGDT to advance the state of the art in offline RL and extend the applicability
of RCSL to a wide range of RL tasks.

离线强化学习中的 CGDT 方法结合了基于值函数的方法和决策 Transformer 的轨迹建模能力，通过整合学习的值函数，保证了指定目标回报和动作预期回报之间的直接对齐，从而弥合了 RCSL 的确定性和基于值函数方法的概率特性之间的差距。在随机环境和 D4RL 基准数据集上进行的实证评估表明，CGDT 方法优于传统的 RCSL 方法，展示了 CGDT 在离线强化学习领域中提升技术水平并扩展 RCSL 在广泛强化学习任务中的适用性的潜力。

离线强化学习的评论引导决策转换器

Critic-Guided Decision Transformer for Offline Reinforcement Learning

Off-policy dynamic programming (DP) techniques such as $Q$-learning have
proven to be an important technique for solving sequential decision-making
problems. However, in the presence of function approximation such algorithms
are not guaranteed to converge, often diverging due to the absence of
Bellman-completeness in the function classes considered, a crucial condition
for the success of DP-based methods. In this paper, we show how off-policy
learning techniques based on return-conditioned supervised learning (RCSL) are
able to circumvent these challenges of Bellman completeness, converging under
significantly more relaxed assumptions inherited from supervised learning. We
prove there exists a natural environment in which if one uses two-layer
multilayer perceptron as the function approximator, the layer width needs to
grow linearly with the state space size to satisfy Bellman-completeness while a
constant layer width is enough for RCSL. These findings take a step towards
explaining the superior empirical performance of RCSL methods compared to
DP-based methods in environments with near-optimal datasets. Furthermore, in
order to learn from sub-optimal datasets, we propose a simple framework called
MBRCSL, granting RCSL methods the ability of dynamic programming to stitch
together segments from distinct trajectories. MBRCSL leverages learned dynamics
models and forward sampling to accomplish trajectory stitching while avoiding
the need for Bellman completeness that plagues all dynamic programming
algorithms. We propose both theoretical analysis and experimental evaluation to
back these claims, outperforming state-of-the-art model-free and model-based
offline RL algorithms across several simulated robotics problems.

在本文中，我们展示了基于回报条件的监督学习（RCSL）的离策略学习技术如何在具有放松了的 Bellman 完备性条件下收敛，使用两层多层感知机作为函数逼近器时实现了与动态规划方法相媲美的性能，并提出了 MBRCSL 框架，通过利用学习的动力学模型和前向采样来实现轨迹拼接，从而避免了所有动态规划算法中困扰的 Bellman 完备性需求。