This paper introduces DiffTOP, which utilizes Differentiable Trajectory
OPtimization as the policy representation to generate actions for deep
reinforcement and imitation learning. Trajectory optimization is a powerful and
widely used algorithm in control, parameterized by a cost and a dynamics
function. The key to our approach is to leverage the recent progress in
differentiable trajectory optimization, which enables computing the gradients
of the loss with respect to the parameters of trajectory optimization. As a
result, the cost and dynamics functions of trajectory optimization can be
learned end-to-end. DiffTOP addresses the ``objective mismatch'' issue of prior
model-based RL algorithms, as the dynamics model in DiffTOP is learned to
directly maximize task performance by differentiating the policy gradient loss
through the trajectory optimization process. We further benchmark DiffTOP for
imitation learning on standard robotic manipulation task suites with
high-dimensional sensory observations and compare our method to feed-forward
policy classes as well as Energy-Based Models (EBM) and Diffusion. Across 15
model-based RL tasks and 13 imitation learning tasks with high-dimensional
image and point cloud inputs, DiffTOP outperforms prior state-of-the-art
methods in both domains.

DiffTOP 利用可微分轨迹优化作为策略表示，通过学习轨迹优化的参数，解决了模型不匹配问题，并在深度增强学习和模仿学习任务中优于当前最先进的方法。

DiffTOP: 深度强化学习和模仿学习中的可微轨迹优化

DiffTOP: Differentiable Trajectory Optimization for Deep Reinforcement  and Imitation Learning

Reinforcement Learning (RL) is notoriously data-inefficient, which makes
training on a real robot difficult. While model-based RL algorithms (world
models) improve data-efficiency to some extent, they still require hours or
days of interaction to learn skills. Recently, offline RL has been proposed as
a framework for training RL policies on pre-existing datasets without any
online interaction. However, constraining an algorithm to a fixed dataset
induces a state-action distribution shift between training and inference, and
limits its applicability to new tasks. In this work, we seek to get the best of
both worlds: we consider the problem of pretraining a world model with offline
data collected on a real robot, and then finetuning the model on online data
collected by planning with the learned model. To mitigate extrapolation errors
during online interaction, we propose to regularize the planner at test-time by
balancing estimated returns and (epistemic) model uncertainty. We evaluate our
method on a variety of visuo-motor control tasks in simulation and on a real
robot, and find that our method enables few-shot finetuning to seen and unseen
tasks even when offline data is limited. Videos, code, and data are available
at this https URL .

通过使用离线数据集在真实机器人上对世界模型进行预训练，然后通过使用学习模型进行在线数据集的规划和微调，本文试图解决强化学习在真实机器人上训练时的数据效率问题，以及模型在训练和推理过程中的分布偏移问题，该方法在模拟环境和真实机器人上的视觉 - 动作控制任务上进行了验证，发现即使离线数据有限，该方法也能实现对已知和未知任务的少次数微调。

在真实环境中微调离线世界模型

Finetuning Offline World Models in the Real World

Learning policies on data synthesized by models can in principle quench the
thirst of reinforcement learning algorithms for large amounts of real
experience, which is often costly to acquire. However, simulating plausible
experience de novo is a hard problem for many complex environments, often
resulting in biases for model-based policy evaluation and search. Instead of de
novo synthesis of data, here we assume logged, real experience and model
alternative outcomes of this experience under counterfactual actions, actions
that were not actually taken. Based on this, we propose the
Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies
in POMDPs from off-policy experience. It leverages structural causal models for
counterfactual evaluation of arbitrary policies on individual off-policy
episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use
of available logged data to de-bias model predictions. In contrast to
off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS
leverages a model to explicitly consider alternative outcomes, allowing the
algorithm to make better use of experience data. We find empirically that these
advantages translate into improved policy evaluation and search results on a
non-trivial grid-world task. Finally, we show that CF-GPS generalizes the
previously proposed Guided Policy Search and that reparameterization-based
algorithms such Stochastic Value Gradient can be interpreted as counterfactual
methods.

利用结构因果模型对离线策略学习算法进行对实验数据的反事实评估，并通过模型预测提高模型预测的偏差。