Explaining the behavior of reinforcement learning agents operating in
sequential decision-making settings is challenging, as their behavior is
affected by a dynamic environment and delayed rewards. Methods that help users
understand the behavior of such agents can roughly be divided into local
explanations that analyze specific decisions of the agents and global
explanations that convey the general strategy of the agents. In this work, we
study a novel combination of local and global explanations for reinforcement
learning agents. Specifically, we combine reward decomposition, a local
explanation method that exposes which components of the reward function
influenced a specific decision, and HIGHLIGHTS, a global explanation method
that shows a summary of the agent's behavior in decisive states. We conducted
two user studies to evaluate the integration of these explanation methods and
their respective benefits. Our results show significant benefits for both
methods. In general, we found that the local reward decomposition was more
useful for identifying the agents' priorities. However, when there was only a
minor difference between the agents' preferences, then the global information
provided by HIGHLIGHTS additionally improved participants' understanding.

本研究探讨将局部和全局解释方法相结合，通过激励分解和 HIGHLIGHTS 两种解释方式，帮助用户理解强化学习算法在决策制定时行为的策略，并通过两个用户研究证明两种方法的显著优势。

将政策摘要与奖励分解相结合，解释强化学习代理

Integrating Policy Summaries with Reward Decomposition for Explaining Reinforcement Learning Agents

The finetuning of pretrained transformer-based language generation models are
typically conducted in an end-to-end manner, where the model learns to attend
to relevant parts of the input by itself. However, there does not exist a
mechanism to directly control the model's focus. This work aims to develop a
control mechanism by which a user can select spans of context as "highlights"
for the model to focus on, and generate relevant output. To achieve this goal,
we augment a pretrained model with trainable "focus vectors" that are directly
applied to the model's embeddings, while the model itself is kept fixed. These
vectors, trained on automatic annotations derived from attribution methods, act
as indicators for context importance. We test our approach on two core
generation tasks: dialogue response generation and abstractive summarization.
We also collect evaluation data where the highlight-generation pairs are
annotated by humans. Our experiments show that the trained focus vectors are
effective in steering the model to generate outputs that are relevant to
user-selected highlights.

该研究旨在开发一种控制机制，使用户可以选择上下文的一部分作为 “亮点”，以便生成相关的输出。研究使用可训练的 “焦点向量” 来指示上下文的重要性，测试其在对话响应生成和提取式摘要生成任务中的有效性。