Reinforcement Learning has recently surfaced as a very powerful tool to solve
complex problems in the domain of board games, wherein an agent is generally
required to learn complex strategies and moves based on its own experiences and
rewards received. While RL has outperformed existing state-of-the-art methods
used for playing simple video games and popular board games, it is yet to
demonstrate its capability on ancient games. Here, we solve one such problem,
where we train our agents using different methods namely Monte Carlo, Qlearning
and Expected Sarsa to learn optimal policy to play the strategic Royal Game of
Ur. The state space for our game is complex and large, but our agents show
promising results at playing the game and learning important strategic moves.
Although it is hard to conclude that when trained with limited resources which
algorithm performs better overall, but Expected Sarsa shows promising results
when it comes to fastest learning.

本研究采用蒙特卡罗、Q 学习和预期 Sarsa 等不同的方法来训练智能体，学习下古老的策略性游戏乌尔王的最优策略，并表现出不错的结果和学习能力。其中，预期 Sarsa 在学习速度方面表现出色。

使用强化学习解决乌尔王宫游戏

Solving Royal Game of Ur Using Reinforcement Learning

We study the convergence of $\mathtt{Expected~Sarsa}(\lambda)$ with linear
function approximation. We show that applying the off-line estimate (multi-step
bootstrapping) to $\mathtt{Expected~Sarsa}(\lambda)$ is unstable for off-policy
learning. Furthermore, based on convex-concave saddle-point framework, we
propose a convergent $\mathtt{Gradient~Expected~Sarsa}(\lambda)$
($\mathtt{GES}(\lambda)$) algorithm. The theoretical analysis shows that our
$\mathtt{GES}(\lambda)$ converges to the optimal solution at a linear
convergence rate, which is comparable to extensive existing state-of-the-art
gradient temporal difference learning algorithms. Furthermore, we develop a
Lyapunov function technique to investigate how the step-size influences
finite-time performance of $\mathtt{GES}(\lambda)$, such technique of Lyapunov
function can be potentially generalized to other GTD algorithms. Finally, we
conduct experiments to verify the effectiveness of our $\mathtt{GES}(\lambda)$.

本研究针对线性函数近似下的 Expected Sarsa 算法的收敛问题，通过提出收敛性较好的 Gradient Expected Sarsa 算法，并应用 Lyapunov 函数技术分析其性能，得到较优实验结果。

梯度预期 Sarsa ($λ$) 的收敛性

On Convergence of Gradient Expected Sarsa($λ$)

We propose expected policy gradients (EPG), which unify stochastic policy
gradients (SPG) and deterministic policy gradients (DPG) for reinforcement
learning. Inspired by expected sarsa, EPG integrates (or sums) across actions
when estimating the gradient, instead of relying only on the action in the
sampled trajectory. For continuous action spaces, we first derive a practical
result for Gaussian policies and quadratic critics and then extend it to a
universal analytical method, covering a broad class of actors and critics,
including Gaussian, exponential families, and policies with bounded support.
For Gaussian policies, we introduce an exploration method that uses covariance
proportional to the matrix exponential of the scaled Hessian of the critic with
respect to the actions. For discrete action spaces, we derive a variant of EPG
based on softmax policies. We also establish a new general policy gradient
theorem, of which the stochastic and deterministic policy gradient theorems are
special cases. Furthermore, we prove that EPG reduces the variance of the
gradient estimates without requiring deterministic policies and with little
computational overhead. Finally, we provide an extensive experimental
evaluation of EPG and show that it outperforms existing approaches on multiple
challenging control domains.

提出了期望策略梯度（EPG）方法，将随机策略梯度（SPG）和确定性策略梯度（DPG）方法统一起来，用于连续或离散动作空间的强化学习中，实验证明其在多项控制任务中胜过现有方法。