Offline reinforcement learning provides a viable approach to obtain advanced
control strategies for dynamical systems, in particular when direct interaction
with the environment is not available. In this paper, we introduce a conceptual
extension for model-based policy search methods, called variable objective
policy (VOP). With this approach, policies are trained to generalize
efficiently over a variety of objectives, which parameterize the reward
function. We demonstrate that by altering the objectives passed as input to the
policy, users gain the freedom to adjust its behavior or re-balance
optimization targets at runtime, without need for collecting additional
observation batches or re-training.

离线强化学习是一种获取动态系统先进控制策略的可行方法，尤其是在无法直接与环境互动时。本文介绍了一种名为可变目标策略（VOP）的基于模型的策略搜索方法的概念扩展。通过此方法，策略被训练以有效地泛化各种目标，这些目标对奖励函数进行参数化。我们证明了通过改变作为输入传递给策略的目标，用户可以在运行时自由调整其行为或重新平衡优化目标，无需收集额外的观察数据或重新训练。

从离线数据中学习可变目标的控制策略

Learning Control Policies for Variable Objectives from Offline Data

Autonomous learning has been a promising direction in control and robotics
for more than a decade since data-driven learning allows to reduce the amount
of engineering knowledge, which is otherwise required. However, autonomous
reinforcement learning (RL) approaches typically require many interactions with
the system to learn controllers, which is a practical limitation in real
systems, such as robots, where many interactions can be impractical and time
consuming. To address this problem, current learning approaches typically
require task-specific knowledge in form of expert demonstrations, realistic
simulators, pre-shaped policies, or specific knowledge about the underlying
dynamics. In this article, we follow a different approach and speed up learning
by extracting more information from data. In particular, we learn a
probabilistic, non-parametric Gaussian process transition model of the system.
By explicitly incorporating model uncertainty into long-term planning and
controller learning our approach reduces the effects of model errors, a key
problem in model-based learning. Compared to state-of-the art RL our
model-based policy search method achieves an unprecedented speed of learning.
We demonstrate its applicability to autonomous learning in real robot and
control tasks.

本文介绍了一种模型基于策略搜索的自动学习方法，使用概率非参数高斯过渡模型从数据中提取更多信息，以提高学习速度并降低模型误差的影响，已在真实机器人和控制任务中得到了应用。