We study how to learn $\epsilon$-optimal strategies in zero-sum imperfect
information games (IIG) with trajectory feedback. In this setting, players
update their policies sequentially based on their observations over a fixed
number of episodes, denoted by $T$. Existing procedures suffer from high
variance due to the use of importance sampling over sequences of actions
(Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we
consider a fixed sampling approach, where players still update their policies
over time, but with observations obtained through a given fixed sampling
policy. Our approach is based on an adaptive Online Mirror Descent (OMD)
algorithm that applies OMD locally to each information set, using individually
decreasing learning rates and a regularized loss. We show that this approach
guarantees a convergence rate of $\tilde{\mathcal{O}}(T^{-1/2})$ with high
probability and has a near-optimal dependence on the game parameters when
applied with the best theoretical choices of learning rates and sampling
policies. To achieve these results, we generalize the notion of OMD
stabilization, allowing for time-varying regularization with convex increments.

我们研究了如何在带有轨迹反馈的零和不完全信息博弈中学习 ε- 最优策略，通过应用自适应在线镜像下降算法，在信息集中使用逐渐减小的学习率和正则化损失，我们证明了该方法在高概率下能够保证收敛速度为～T^(-1/2)，并且在理论上的最佳学习率和采样策略选择时，对于游戏参数的依赖性接近最优。为了实现这些结果，我们扩展了对 OMD 稳定性的概念，允许随时间变化的凸增量正则化。