Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL
-- have seen recent use as a way to express non-Markovian objectives in
reinforcement learning. We introduce a model-based probably approximately
correct (PAC) learning algorithm for omega-regular objectives in Markov
decision processes. Unlike prior approaches, our algorithm learns from sampled
trajectories of the system and does not require prior knowledge of the system's
topology.

引入了一个基于模型的近似正确（PAC）学习算法，用于解决马尔可夫决策过程中的 omega 正则目标。不同于之前的方法，该算法从系统的采样轨迹中学习，不需要对系统拓扑的先验知识。

在 MDPs 中用于 LTL 和 ω-regular 目标的 PAC 学习算法

A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs

When omega-regular objectives were first proposed in model-free reinforcement
learning (RL) for controlling MDPs, deterministic Rabin automata were used in
an attempt to provide a direct translation from their transitions to scalar
values. While these translations failed, it has turned out that it is possible
to repair them by using good-for-MDPs (GFM) Büchi automata instead. These are
nondeterministic Büchi automata with a restricted type of nondeterminism,
albeit not as restricted as in good-for-games automata. Indeed, deterministic
Rabin automata have a pretty straightforward translation to such GFM automata,
which is bi-linear in the number of states and pairs. Interestingly, the same
cannot be said for deterministic Streett automata: a translation to
nondeterministic Rabin or Büchi automata comes at an exponential cost, even
without requiring the target automaton to be good-for-MDPs. Do we have to pay
more than that to obtain a good-for-MDP automaton? The surprising answer is
that we have to pay significantly less when we instead expand the good-for-MDP
property to alternating automata: like the nondeterministic GFM automata
obtained from deterministic Rabin automata, the alternating good-for-MDP
automata we produce from deterministic Streett automata are bi-linear in the
the size of the deterministic automaton and its index, and can therefore be
exponentially more succinct than minimal nondeterministic Büchi automata.

本研究发现，当使用好的 MDP Buchi 自动机来代替确定性 Rabin 自动机时，可以更好地将 omega-regular 目标使用于模型无关的强化学习中，并且使用 Streett 自动机所得到的交替好的 MDP 自动机，可以比最小的非确定性 Buchi 自动机更加简洁。

交替式好状态马尔可夫决策自动机

Alternating Good-for-MDP Automata

We study observation-based strategies for partially-observable Markov
decision processes (POMDPs) with omega-regular objectives. An observation-based
strategy relies on partial information about the history of a play, namely, on
the past sequence of observations. We consider the qualitative analysis
problem: given a POMDP with an omega-regular objective, whether there is an
observation-based strategy to achieve the objective with probability~1
(almost-sure winning), or with positive probability (positive winning). Our
main results are twofold. First, we present a complete picture of the
computational complexity of the qualitative analysis of POMDP s with parity
objectives (a canonical form to express omega-regular objectives) and its
subclasses. Our contribution consists in establishing several upper and lower
bounds that were not known in literature. Second, we present optimal bounds
(matching upper and lower bounds) on the memory required by pure and randomized
observation-based strategies for the qualitative analysis of POMDP s with
parity objectives and its subclasses.

本文针对部分可观察 Markov 决策过程（POMDPs）和 ω 正则目标，研究了基于观测的策略，并解决了定性分析问题的计算复杂度和最优内存界限。

部分可观察马尔可夫决策过程的定性分析

Qualitative Analysis of Partially-observable Markov Decision Processes

We study observation-based strategies for two-player turn-based games on
graphs with omega-regular objectives. An observation-based strategy relies on
imperfect information about the history of a play, namely, on the past sequence
of observations. Such games occur in the synthesis of a controller that does
not see the private state of the plant. Our main results are twofold. First, we
give a fixed-point algorithm for computing the set of states from which a
player can win with a deterministic observation-based strategy for any
omega-regular objective. The fixed point is computed in the lattice of
antichains of state sets. This algorithm has the advantages of being directed
by the objective and of avoiding an explicit subset construction on the game
graph. Second, we give an algorithm for computing the set of states from which
a player can win with probability 1 with a randomized observation-based
strategy for a Buechi objective. This set is of interest because in the absence
of perfect information, randomized strategies are more powerful than
deterministic ones. We show that our algorithms are optimal by proving matching
lower bounds.

本文研究了基于观察的策略在具有 ω- 正则目标的图上的两人回合制游戏中的应用。我们提出了计算状态集合的固定点算法，解决了玩家使用确定性和随机化观察策略进行游戏的问题。