A recent method for solving zero-sum partially observable stochastic games
(zs-POSGs) embeds the original game into a new one called the occupancy Markov
game. This reformulation allows applying Bellman's principle of optimality to
solve zs-POSGs. However, improving a current solution requires solving a linear
program with exponentially many potential constraints, which significantly
restricts the scalability of this approach. This paper exploits the optimal
value function's novel uniform continuity properties to overcome this
limitation. We first construct a new operator that is computationally more
efficient than the state-of-the-art update rules without compromising
optimality. In particular, improving a current solution now involves a linear
program with an exponential drop in constraints. We then also show that
point-based value iteration algorithms utilizing our findings improve the
scalability of existing methods while maintaining guarantees in various
domains.

该研究论文介绍了一种解决零和部分可观察随机博弈问题的方法，通过将原始游戏嵌入一个称为占用马尔可夫游戏的新游戏中，可以应用贝尔曼最优原理来解决该问题。此方法通过利用价值函数的均匀连续性特性来提高解决方法的可扩展性，并且提出了一种比现有更新规则更高效的算子，降低了线性规划的约束数，并且展示了利用这些发现的基于点的值迭代算法在各种领域中提高了现有方法的可扩展性并保持了保证。

$ε$- 优化地求解零和 POSG

$ε$-Optimally Solving Zero-Sum POSGs

In many problems, it is desirable to optimize an objective function while
imposing constraints on some other aspect of the problem. A Constrained
Partially Observable Markov Decision Process (C-POMDP) allows modelling of such
problems while subject to transition uncertainty and partial observability.
Typically, the constraints in C-POMDPs enforce a threshold on expected
cumulative costs starting from an initial state distribution. In this work, we
first show that optimal C-POMDP policies may violate Bellman's principle of
optimality and thus may exhibit pathological behaviors, which can be
undesirable for many applications. To address this drawback, we introduce a new
formulation, the Recursively-Constrained POMDP (RC-POMDP), that imposes
additional history dependent cost constraints on the C-POMDP. We show that,
unlike C-POMDPs, RC-POMDPs always have deterministic optimal policies, and that
optimal policies obey Bellman's principle of optimality. We also present a
point-based dynamic programming algorithm that synthesizes optimal policies for
RC-POMDPs. In our evaluations, we show that policies for RC-POMDPs produce more
desirable behavior than policies for C-POMDPs and demonstrate the efficacy of
our algorithm across a set of benchmark problems.

通过引入新的限制性、历史依赖成本约束的递归约束部分可观察马尔可夫决策问题 (RC-POMDP)，本文解决了常规约束部分可观察马尔可夫决策问题 (C-POMDP) 中存在的问题，并提出了一个基于点的动态规划算法来寻找 RC-POMDP 的最优策略。实验证明，相比于 C-POMDP 的策略，RC-POMDP 的策略具有更好的行为，并展示了算法在一组基准问题上的有效性。