Regular decision processes (RDPs) are a subclass of non-Markovian decision
processes where the transition and reward functions are guarded by some regular
property of the past (a lookback). While RDPs enable intuitive and succinct
representation of non-Markovian decision processes, their expressive power
coincides with finite-state Markov decision processes (MDPs). We introduce
omega-regular decision processes (ODPs) where the non-Markovian aspect of the
transition and reward functions are extended to an omega-regular lookahead over
the system evolution. Semantically, these lookaheads can be considered as
promises made by the decision maker or the learning agent about her future
behavior. In particular, we assume that, if the promised lookaheads are not
met, then the payoff to the decision maker is $\bot$ (least desirable payoff),
overriding any rewards collected by the decision maker. We enable optimization
and learning for ODPs under the discounted-reward objective by reducing them to
lexicographic optimization and learning over finite MDPs. We present
experimental results demonstrating the effectiveness of the proposed reduction.

引入一种新型的 omega-regular 决策过程 (ODPs)，通过将其规约为有限 MDPs 上的字典序优化和学习，实现了对 ODPs 的优化和学习。

Omega 正则决策过程

Omega-Regular Decision Processes

Recently regular decision processes have been proposed as a well-behaved form
of non-Markov decision process. Regular decision processes are characterised by
a transition function and a reward function that depend on the whole history,
though regularly (as in regular languages). In practice both the transition and
the reward functions can be seen as finite transducers. We study reinforcement
learning in regular decision processes. Our main contribution is to show that a
near-optimal policy can be PAC-learned in polynomial time in a set of
parameters that describe the underlying decision process. We argue that the
identified set of parameters is minimal and it reasonably captures the
difficulty of a regular decision process.

本文探讨了如何在正则决策流程中实现强化学习，并提出了在一组参数中可以 PAC - 学习这一流程的最优策略。