We can usually assume others have goals analogous to our own. This assumption
can also, at times, be applied to multi-agent games - e.g. Agent 1's attraction
to green pellets is analogous to Agent 2's attraction to red pellets. This
"analogy" assumption is tied closely to the cognitive process known as empathy.
Inspired by empathy, we design a simple and explainable architecture to model
another agent's action-value function. This involves learning an "Imagination
Network" to transform the other agent's observed state in order to produce a
human-interpretable "empathetic state" which, when presented to the learning
agent, produces behaviours that mimic the other agent. Our approach is
applicable to multi-agent scenarios consisting of a single learning agent and
other (independent) agents acting according to fixed policies. This
architecture is particularly beneficial for (but not limited to) algorithms
using a composite value or reward function. We show our method produces better
performance in multi-agent games, where it robustly estimates the other's model
in different environment configurations. Additionally, we show that the
empathetic states are human interpretable, and thus verifiable.

通过设计 Imagination Network 对另一个玩家的行动价值函数进行建模，从而使学习代理产生类似于其他代理的行为，这种方法尤其适用于使用复合价值或奖励函数的算法，在多智能体游戏中产生更好的性能表现。

EMOTE: 一种基于共情的可解释性建模他人的架构

EMOTE: An Explainable architecture for Modelling the Other Through  Empathy

Applying probabilistic models to reinforcement learning (RL) enables the
application of powerful optimisation tools such as variational inference to RL.
However, existing inference frameworks and their algorithms pose significant
challenges for learning optimal policies, e.g., the absence of mode capturing
behaviour in pseudo-likelihood methods and difficulties learning deterministic
policies in maximum entropy RL based approaches. We propose VIREL, a novel,
theoretically grounded probabilistic inference framework for RL that utilises a
parametrised action-value function to summarise future dynamics of the
underlying MDP. This gives VIREL a mode-seeking form of KL divergence, the
ability to learn deterministic optimal polices naturally from inference and the
ability to optimise value functions and policies in separate, iterative steps.
In applying variational expectation-maximisation to VIREL we thus show that the
actor-critic algorithm can be reduced to expectation-maximisation, with policy
improvement equivalent to an E-step and policy evaluation to an M-step. We then
derive a family of actor-critic methods from VIREL, including a scheme for
adaptive exploration. Finally, we demonstrate that actor-critic algorithms from
this family outperform state-of-the-art methods based on soft value functions
in several domains.

提出一种新的基于概率模型的强化学习方法 VIREL，通过应用参数化的动作值函数来总结底层 MDP 系统的未来动态，使 VIREL 具有 KL 散度的寻找峰值形式、自然地从推断中学习确定性最佳策略的能力和分别优化价值函数和策略的能力。通过对 VIREL 应用变分期望最大化方法，我们表明可以将 Actor-critic 算法简化为期望最大化，其中策略改进对应 E 步骤，策略评估对应 M 步骤，最后，我们展示了来自这个家族的 Actor-critic 算法在几个领域优于基于软值函数的最新方法。