A wide array of modern machine learning applications - from adversarial
models to multi-agent reinforcement learning - can be formulated as
non-cooperative games whose Nash equilibria represent the system's desired
operational states. Despite having a highly non-convex loss landscape, many
cases of interest possess a latent convex structure that could potentially be
leveraged to yield convergence to equilibrium. Driven by this observation, our
paper proposes a flexible first-order method that successfully exploits such
"hidden structures" and achieves convergence under minimal assumptions for the
transformation connecting the players' control variables to the game's latent,
convex-structured layer. The proposed method - which we call preconditioned
hidden gradient descent (PHGD) - hinges on a judiciously chosen gradient
preconditioning scheme related to natural gradient methods. Importantly, we
make no separability assumptions for the game's hidden structure, and we
provide explicit convergence rate guarantees for both deterministic and
stochastic environments.

该研究提出了一种名为预条件隐藏梯度下降（PHGD）的灵活的一阶方法，旨在利用机器学习中隐藏的凸结构以实现收敛到均衡状态。研究对非合作博弈、Nash 均衡和控制变量与凸结构之间的转换提供了明确的收敛率保证。

利用非凸博弈中的隐藏结构以达到纳什均衡点的收敛

Exploiting hidden structures in non-convex games for convergence to Nash  equilibrium

This paper examines the long-run behavior of learning with bandit feedback in
non-cooperative concave games. The bandit framework accounts for extremely
low-information environments where the agents may not even know they are
playing a game; as such, the agents' most sensible choice in this setting would
be to employ a no-regret learning algorithm. In general, this does not mean
that the players' behavior stabilizes in the long run: no-regret learning may
lead to cycles, even with perfect gradient information. However, if a standard
monotonicity condition is satisfied, our analysis shows that no-regret learning
based on mirror descent with bandit feedback converges to Nash equilibrium with
probability $1$. We also derive an upper bound for the convergence rate of the
process that nearly matches the best attainable rate for single-agent bandit
stochastic optimization.

研究了非协同凹性博弈中以赌徒反馈为学习手段的长期行为，证明了采用镜像下降算法的不懊悔学习算法在满足标准单调性条件下能以概率 1 收敛于 Nash 均衡，并推导出了其收敛速率的上界。