Standard model-based reinforcement learning (MBRL) approaches fit a
transition model of the environment to all past experience, but this wastes
model capacity on data that is irrelevant for policy improvement. We instead
propose a new "transition occupancy matching" (TOM) objective for MBRL model
learning: a model is good to the extent that the current policy experiences the
same distribution of transitions inside the model as in the real environment.
We derive TOM directly from a novel lower bound on the standard reinforcement
learning objective. To optimize TOM, we show how to reduce it to a form of
importance weighted maximum-likelihood estimation, where the automatically
computed importance weights identify policy-relevant past experiences from a
replay buffer, enabling stable optimization. TOM thus offers a plug-and-play
model learning sub-routine that is compatible with any backbone MBRL algorithm.
On various Mujoco continuous robotic control tasks, we show that TOM
successfully focuses model learning on policy-relevant experience and drives
policies faster to higher task rewards than alternative model learning
approaches.

本文提出了一种新的模型学习目标 TOM，利用重要性加权最大似然估计从回放缓冲中识别与策略相关的过去经验，使模型学习更专注于策略相关的经验，从而使策略更快地完成任务并获得更高的奖励。