In general-sum stochastic games, a stationary Stackelberg equilibrium (SSE) does not always exist, in which the leader maximizes leader's return for all the initial states when the follower takes the best response against the leader's policy. Existing methods of determining the SSEs require strong assumptions to guarantee the convergence and the coincidence of the limit with the SSE. Moreover, our analysis suggests that the performance at the fixed points of these methods is not reasonable when they are not SSEs. Herein, we introduced the concept of Pareto-optimality as a reasonable alternative to SSEs. We derive the policy improvement theorem for stochastic games with the best-response follower and propose an iterative algorithm to determine the Pareto-optimal policies based on it. Monotone improvement and convergence of the proposed approach are proved, and its convergence to SSEs is proved in a special case.

在广义随机博弈中，引入了Pareto最优概念作为可替代的平衡点，提出了基于最优反应的随机博弈的政策改进定理，并提出了一种迭代算法来确定Pareto最优策略，证明了该方法的单调改进性和收敛性，以及在特殊情况下收敛到平衡点的性质。

随机斯塔克贝格博弈中的帕累托最优策略的政策迭代