Model-based reinforcement learning (MBRL) holds the promise of
sample-efficient learning by utilizing a world model, which models how the
environment works and typically encompasses components for two tasks:
observation modeling and reward modeling. In this paper, through a dedicated
empirical investigation, we gain a deeper understanding of the role each task
plays in world models and uncover the overlooked potential of more efficient
MBRL by harmonizing the interference between observation and reward modeling.
Our key insight is that while prevalent approaches of explicit MBRL attempt to
restore abundant details of the environment through observation models, it is
difficult due to the environment's complexity and limited model capacity. On
the other hand, reward models, while dominating in implicit MBRL and adept at
learning task-centric dynamics, are inadequate for sample-efficient learning
without richer learning signals. Capitalizing on these insights and
discoveries, we propose a simple yet effective method, Harmony World Models
(HarmonyWM), that introduces a lightweight harmonizer to maintain a dynamic
equilibrium between the two tasks in world model learning. Our experiments on
three visual control domains show that the base MBRL method equipped with
HarmonyWM gains 10%-55% absolute performance boosts.

通过进行实证调查，这篇论文深入研究了世界模型中观察建模和奖励建模的作用，并发现在调和观察和奖励建模之间的干扰方面存在更高效的模型驱动强化学习的潜力。借助这些发现，提出了一种称为 Harmony World Models（HarmonyWM）的简单而有效的方法，通过引入一个轻量级的调和器来保持世界模型学习中两个任务之间的动态平衡。实验结果表明，基于 HarmonyWM 方法的基础模型驱动强化学习方法在三个视觉控制领域取得了 10% 至 55% 的绝对性能提升。