Reinforcement learning (RL) is a powerful technique for training intelligent
agents, but understanding why these agents make specific decisions can be quite
challenging. This lack of transparency in RL models has been a long-standing
problem, making it difficult for users to grasp the reasons behind an agent's
behaviour. Various approaches have been explored to address this problem, with
one promising avenue being reward decomposition (RD). RD is appealing as it
sidesteps some of the concerns associated with other methods that attempt to
rationalize an agent's behaviour in a post-hoc manner. RD works by exposing
various facets of the rewards that contribute to the agent's objectives during
training. However, RD alone has limitations as it primarily offers insights
based on sub-rewards and does not delve into the intricate cause-and-effect
relationships that occur within an RL agent's neural model. In this paper, we
present an extension of RD that goes beyond sub-rewards to provide more
informative explanations. Our approach is centred on a causal learning
framework that leverages information-theoretic measures for explanation
objectives that encourage three crucial properties of causal factors:
\emph{causal sufficiency}, \emph{sparseness}, and \emph{orthogonality}. These
properties help us distill the cause-and-effect relationships between the
agent's states and actions or rewards, allowing for a deeper understanding of
its decision-making processes. Our framework is designed to generate local
explanations and can be applied to a wide range of RL tasks with multiple
reward channels. Through a series of experiments, we demonstrate that our
approach offers more meaningful and insightful explanations for the agent's
action selections.

本文介绍了一种扩展奖励分解方法的因果学习框架，通过利用信息论度量的解释目标来鼓励因果因素的三个关键属性：因果充分性、稀疏性和正交性，并通过提取智能体状态、动作或奖励之间的因果关系深入理解其决策过程，从而为行动选择提供更有意义和有洞察力的解释。

为可解释强化学习进行因果状态精炼

Causal State Distillation for Explainable Reinforcement Learning

Explainable reinforcement learning (XRL) methods aim to help elucidate agent
policies and decision-making processes. The majority of XRL approaches focus on
local explanations, seeking to shed light on the reasons an agent acts the way
it does at a specific world state. While such explanations are both useful and
necessary, they typically do not portray the outcomes of the agent's selected
choice of action. In this work, we propose ``COViz'', a new local explanation
method that visually compares the outcome of an agent's chosen action to a
counterfactual one. In contrast to most local explanations that provide
state-limited observations of the agent's motivation, our method depicts
alternative trajectories the agent could have taken from the given state and
their outcomes. We evaluated the usefulness of COViz in supporting people's
understanding of agents' preferences and compare it with reward decomposition,
a local explanation method that describes an agent's expected utility for
different actions by decomposing it into meaningful reward types. Furthermore,
we examine the complementary benefits of integrating both methods. Our results
show that such integration significantly improved participants' performance.

提出了一种名为 COViz 的本地解释方法，将代理选择的行为结果与对照行为结果进行可视化比较，并与奖励分解法进行了对比评估，结果显示两种方法的综合利用显著提高了参与者的性能。

通过反事实行动结果解释强化学习代理

Explaining Reinforcement Learning Agents Through Counterfactual Action  Outcomes

Explaining the behavior of intelligent agents such as robots to humans is
challenging due to their incomprehensible proprioceptive states, variational
intermediate goals, and resultant unpredictability. Moreover, one-step
explanations for reinforcement learning agents can be ambiguous as they fail to
account for the agent's future behavior at each transition, adding to the
complexity of explaining robot actions. By leveraging abstracted actions that
map to task-specific primitives, we avoid explanations on the movement level.
Our proposed framework combines reward decomposition (RD) with abstracted
action spaces into an explainable learning framework, allowing for
non-ambiguous and high-level explanations based on object properties in the
task. We demonstrate the effectiveness of our framework through quantitative
and qualitative analysis of two robot scenarios, showcasing visual and textual
explanations, from output artifacts of RD explanation, that are easy for humans
to comprehend. Additionally, we demonstrate the versatility of integrating
these artifacts with large language models for reasoning and interactive
querying.

本文提出利用抽象动作和奖励分解技术的可解释学习框架，使得机器人动作的解释更易于人类理解，并通过两个场景的定量和定性分析，展示了该框架的有效性。

高水平机器人解释的奖励分解探究

A Closer Look at Reward Decomposition for High-Level Robotic  Explanations

Explaining the behavior of reinforcement learning agents operating in
sequential decision-making settings is challenging, as their behavior is
affected by a dynamic environment and delayed rewards. Methods that help users
understand the behavior of such agents can roughly be divided into local
explanations that analyze specific decisions of the agents and global
explanations that convey the general strategy of the agents. In this work, we
study a novel combination of local and global explanations for reinforcement
learning agents. Specifically, we combine reward decomposition, a local
explanation method that exposes which components of the reward function
influenced a specific decision, and HIGHLIGHTS, a global explanation method
that shows a summary of the agent's behavior in decisive states. We conducted
two user studies to evaluate the integration of these explanation methods and
their respective benefits. Our results show significant benefits for both
methods. In general, we found that the local reward decomposition was more
useful for identifying the agents' priorities. However, when there was only a
minor difference between the agents' preferences, then the global information
provided by HIGHLIGHTS additionally improved participants' understanding.

本研究探讨将局部和全局解释方法相结合，通过激励分解和 HIGHLIGHTS 两种解释方式，帮助用户理解强化学习算法在决策制定时行为的策略，并通过两个用户研究证明两种方法的显著优势。