Off-policy dynamic programming (DP) techniques such as $Q$-learning have
proven to be an important technique for solving sequential decision-making
problems. However, in the presence of function approximation such algorithms
are not guaranteed to converge, often diverging due to the absence of
Bellman-completeness in the function classes considered, a crucial condition
for the success of DP-based methods. In this paper, we show how off-policy
learning techniques based on return-conditioned supervised learning (RCSL) are
able to circumvent these challenges of Bellman completeness, converging under
significantly more relaxed assumptions inherited from supervised learning. We
prove there exists a natural environment in which if one uses two-layer
multilayer perceptron as the function approximator, the layer width needs to
grow linearly with the state space size to satisfy Bellman-completeness while a
constant layer width is enough for RCSL. These findings take a step towards
explaining the superior empirical performance of RCSL methods compared to
DP-based methods in environments with near-optimal datasets. Furthermore, in
order to learn from sub-optimal datasets, we propose a simple framework called
MBRCSL, granting RCSL methods the ability of dynamic programming to stitch
together segments from distinct trajectories. MBRCSL leverages learned dynamics
models and forward sampling to accomplish trajectory stitching while avoiding
the need for Bellman completeness that plagues all dynamic programming
algorithms. We propose both theoretical analysis and experimental evaluation to
back these claims, outperforming state-of-the-art model-free and model-based
offline RL algorithms across several simulated robotics problems.

在本文中，我们展示了基于回报条件的监督学习（RCSL）的离策略学习技术如何在具有放松了的 Bellman 完备性条件下收敛，使用两层多层感知机作为函数逼近器时实现了与动态规划方法相媲美的性能，并提出了 MBRCSL 框架，通过利用学习的动力学模型和前向采样来实现轨迹拼接，从而避免了所有动态规划算法中困扰的 Bellman 完备性需求。