Intrinsic motivation (IM) and reward shaping are common methods for guiding the exploration of reinforcement learning (RL) agents by adding pseudo-rewards. Designing these rewards is challenging, however, and they can counter-intuitively harm performance. To address this, we characterize them as reward shaping in Bayes-Adaptive Markov Decision Processes (BAMDPs), which formalizes the value of exploration by formulating the RL process as updating a prior over possible MDPs through experience. RL algorithms can be viewed as BAMDP policies; instead of attempting to find optimal algorithms by solving BAMDPs directly, we use it at a theoretical framework for understanding how pseudo-rewards guide suboptimal algorithms. By decomposing BAMDP state value into the value of the information collected plus the prior value of the physical state, we show how psuedo-rewards can help by compensating for RL algorithms' misestimation of these two terms, yielding a new typology of IM and reward shaping approaches. We carefully extend the potential-based shaping theorem to BAMDPs to prove that when pseudo-rewards are BAMDP Potential-based shaping Functions (BAMPFs), they preserve optimal, or approximately optimal, behavior of RL algorithms; otherwise, they can corrupt even optimal learners. We finally give guidance on how to design or convert existing pseudo-rewards to BAMPFs by expressing assumptions about the environment as potential functions on BAMDP states.

本研究解决了内在动机和奖励塑形在强化学习中的设计挑战，提出将其视为贝叶斯自适应马尔可夫决策过程（BAMDP）中的奖励塑形。研究表明，当伪奖励符合BAMDP潜力基础塑形函数时，可以保持强化学习算法的最优或近似最优行为，从而为奖励设计提供了新的指导。

BAMDP塑形：内在动机与奖励塑形的统一理论框架