Inverse reinforcement learning (IRL) is the problem of inferring a reward
function from expert behavior. There are several approaches to IRL, but most
are designed to learn a Markovian reward. However, a reward function might be
non-Markovian, depending on more than just the current state, such as a reward
machine (RM). Although there has been recent work on inferring RMs, it assumes
access to the reward signal, absent in IRL. We propose a Bayesian IRL (BIRL)
framework for inferring RMs directly from expert behavior, requiring
significant changes to the standard framework. We define a new reward space,
adapt the expert demonstration to include history, show how to compute the
reward posterior, and propose a novel modification to simulated annealing to
maximize this posterior. We demonstrate that our method performs well when
optimizing according to its inferred reward and compares favorably to an
existing method that learns exclusively binary non-Markovian rewards.

从专家行为中直接推断出奖励机制（RM）的贝叶斯逆强化学习（BIRL）框架，以非 Markovian 奖励函数为基础进行了重大改进，新的奖励空间定义，将专家示范调整为包括历史，展示了计算奖励后验的方法，并提出了一种模拟退火的新修改方案来最大化后验，通过优化其推断的奖励表现出良好性能，并与学习严格二值非 Markovian 奖励的现有方法进行了有利比较。

贝叶斯逆强化学习用于非马尔可夫奖励

Bayesian Inverse Reinforcement Learning for Non-Markovian Rewards

Many Reinforcement Learning algorithms assume a Markov reward function to
guarantee optimality. However, not all reward functions are known to be Markov.
In this paper, we propose a framework for mapping non-Markov reward functions
into equivalent Markov ones by learning a Reward Machine - a specialized reward
automaton. Unlike the general practice of learning Reward Machines, we do not
require a set of high-level propositional symbols from which to learn. Rather,
we learn \emph{hidden triggers} directly from data that encode them. We
demonstrate the importance of learning Reward Machines versus their
Deterministic Finite-State Automata counterparts, for this task, given their
ability to model reward dependencies in a single automaton. We formalize this
distinction in our learning objective. Our mapping process is constructed as an
Integer Linear Programming problem. We prove that our mappings provide
consistent expectations for the underlying process. We empirically validate our
approach by learning black-box non-Markov Reward functions in the Officeworld
Domain. Additionally, we demonstrate the effectiveness of learning dependencies
between rewards in a new domain, Breakfastworld.

通过学习奖励机制，将非马尔可夫奖励函数映射为等效的马尔可夫函数，证明了奖励机制相对于确定性有限状态自动机对于建模单一自动机中的奖励依赖性的重要性，并通过在 Officeworld 领域学习黑盒非马尔可夫奖励函数以及在 Breakfastworld 领域学习奖励之间的依赖关系的有效性来验证了我们的方法。

检测隐藏的触发器：将非马尔可夫奖励函数映射到马尔可夫

Detecting Hidden Triggers: Mapping Non-Markov Reward Functions to Markov

We study the problem of reinforcement learning for a task encoded by a reward
machine. The task is defined over a set of properties in the environment,
called atomic propositions, and represented by Boolean variables. One
unrealistic assumption commonly used in the literature is that the truth values
of these propositions are accurately known. In real situations, however, these
truth values are uncertain since they come from sensors that suffer from
imperfections. At the same time, reward machines can be difficult to model
explicitly, especially when they encode complicated tasks. We develop a
reinforcement-learning algorithm that infers a reward machine that encodes the
underlying task while learning how to execute it, despite the uncertainties of
the propositions' truth values. In order to address such uncertainties, the
algorithm maintains a probabilistic estimate about the truth value of the
atomic propositions; it updates this estimate according to new sensory
measurements that arrive from the exploration of the environment. Additionally,
the algorithm maintains a hypothesis reward machine, which acts as an estimate
of the reward machine that encodes the task to be learned. As the agent
explores the environment, the algorithm updates the hypothesis reward machine
according to the obtained rewards and the estimate of the atomic propositions'
truth value. Finally, the algorithm uses a Q-learning procedure for the states
of the hypothesis reward machine to determine the policy that accomplishes the
task. We prove that the algorithm successfully infers the reward machine and
asymptotically learns a policy that accomplishes the respective task.

该论文研究了强化学习在奖励机制下的任务，提出了一种利用概率估计和 Q-learning 算法的强化学习算法，能够成功推断出奖励机制并且渐进地学习任务的策略，即使环境中的原子命题的真值存在不确定性。