Current approaches to model-based offline Reinforcement Learning (RL) often
incorporate uncertainty-based reward penalization to address the distributional
shift problem. While these approaches have achieved some success, we argue that
this penalization introduces excessive conservatism, potentially resulting in
suboptimal policies through underestimation. We identify as an important cause
of over-penalization the lack of a reliable uncertainty estimator capable of
propagating uncertainties in the Bellman operator. The common approach to
calculating the penalty term relies on sampling-based uncertainty estimation,
resulting in high variance. To address this challenge, we propose a novel
method termed Moment Matching Offline Model-Based Policy Optimization (MOMBO).
MOMBO learns a Q-function using moment matching, which allows us to
deterministically propagate uncertainties through the Q-function. We evaluate
MOMBO's performance across various environments and demonstrate empirically
that MOMBO is a more stable and sample-efficient approach.

利用动量匹配离线模型优化的方法 (MOMBO)，通过确定性传播不确定性，解决了模型基于离线强化学习中由于过度惩罚导致次优策略问题的挑战，并通过在各种环境中的实证研究证明 MOMBO 是更稳定和更高效的方法。