Offline reinforcement learning provides a viable approach to obtain advanced
control strategies for dynamical systems, in particular when direct interaction
with the environment is not available. In this paper, we introduce a conceptual
extension for model-based policy search methods, called variable objective
policy (VOP). With this approach, policies are trained to generalize
efficiently over a variety of objectives, which parameterize the reward
function. We demonstrate that by altering the objectives passed as input to the
policy, users gain the freedom to adjust its behavior or re-balance
optimization targets at runtime, without need for collecting additional
observation batches or re-training.

离线强化学习是一种获取动态系统先进控制策略的可行方法，尤其是在无法直接与环境互动时。本文介绍了一种名为可变目标策略（VOP）的基于模型的策略搜索方法的概念扩展。通过此方法，策略被训练以有效地泛化各种目标，这些目标对奖励函数进行参数化。我们证明了通过改变作为输入传递给策略的目标，用户可以在运行时自由调整其行为或重新平衡优化目标，无需收集额外的观察数据或重新训练。