Gradient-based learning in multi-agent systems is difficult because the
gradient derives from a first-order model which does not account for the
interaction between agents' learning processes. LOLA (arXiv:1709.04326)
accounts for this by differentiating through one step of optimization. We
extend the ideas of LOLA and develop a fully-general value-based approach to
optimization. At the core is a function we call the meta-value, which at each
point in joint-policy space gives for each agent a discounted sum of its
objective over future optimization steps. We argue that the gradient of the
meta-value gives a more reliable improvement direction than the gradient of the
original objective, because the meta-value derives from empirical observations
of the effects of optimization. We show how the meta-value can be approximated
by training a neural network to minimize TD error along optimization
trajectories in which agents follow the gradient of the meta-value. We analyze
the behavior of our method on the Logistic Game and on the Iterated Prisoner's
Dilemma.

在多智能体系统中，基于梯度的学习很困难，LOLA 通过在一步优化中不同化来解决这个问题，我们通过扩展 LOLA 的思想并开发出一种完全通用的基于价值的优化方法，核心是一个称为元 - 价值的函数，它在联合策略空间的每个点为每个智能体给出折现未来优化步骤中的目标的总和，我们通过训练神经网络以最小化沿优化轨迹上 TD 误差的方法来近似元 - 价值。

元值学习：具备学习认知能力的一般性学习框架

Meta-Value Learning: a General Framework for Learning with Learning  Awareness

Learning in general-sum games is unstable and frequently leads to socially
undesirable (Pareto-dominated) outcomes. To mitigate this, Learning with
Opponent-Learning Awareness (LOLA) introduced opponent shaping to this setting,
by accounting for each agent's influence on their opponents' anticipated
learning steps. However, the original LOLA formulation (and follow-up work) is
inconsistent because LOLA models other agents as naive learners rather than
LOLA agents. In previous work, this inconsistency was suggested as a cause of
LOLA's failure to preserve stable fixed points (SFPs). First, we formalize
consistency and show that higher-order LOLA (HOLA) solves LOLA's inconsistency
problem if it converges. Second, we correct a claim made in the literature by
Schäfer and Anandkumar (2019), proving that Competitive Gradient Descent
(CGD) does not recover HOLA as a series expansion (and fails to solve the
consistency problem). Third, we propose a new method called Consistent LOLA
(COLA), which learns update functions that are consistent under mutual opponent
shaping. It requires no more than second-order derivatives and learns
consistent update functions even when HOLA fails to converge. However, we also
prove that even consistent update functions do not preserve SFPs, contradicting
the hypothesis that this shortcoming is caused by LOLA's inconsistency.
Finally, in an empirical evaluation on a set of general-sum games, we find that
COLA finds prosocial solutions and that it converges under a wider range of
learning rates than HOLA and LOLA. We support the latter finding with a
theoretical result for a simple game.

通过在 LOLA 算法中引入一种方法称为 Consistent LOLA，其中学习更新功能在彼此影响时保持一致，作者在广义和游戏模型中进行了一系列实验，发现这种方法比 HOLA 和 LOLA 更容易收敛，并能够找到更加符合社会期望的解决方案。

COLA: 具有对手感知的一致学习

COLA: Consistent Learning with Opponent-Learning Awareness

Multi-agent settings are quickly gathering importance in machine learning.
This includes a plethora of recent work on deep multi-agent reinforcement
learning, but also can be extended to hierarchical RL, generative adversarial
networks and decentralised optimisation. In all these settings the presence of
multiple learning agents renders the training problem non-stationary and often
leads to unstable training or undesired final results. We present Learning with
Opponent-Learning Awareness (LOLA), a method in which each agent shapes the
anticipated learning of the other agents in the environment. The LOLA learning
rule includes a term that accounts for the impact of one agent's policy on the
anticipated parameter update of the other agents. Results show that the
encounter of two LOLA agents leads to the emergence of tit-for-tat and
therefore cooperation in the iterated prisoners' dilemma, while independent
learning does not. In this domain, LOLA also receives higher payouts compared
to a naive learner, and is robust against exploitation by higher order
gradient-based methods. Applied to repeated matching pennies, LOLA agents
converge to the Nash equilibrium. In a round robin tournament we show that LOLA
agents successfully shape the learning of a range of multi-agent learning
algorithms from literature, resulting in the highest average returns on the
IPD. We also show that the LOLA update rule can be efficiently calculated using
an extension of the policy gradient estimator, making the method suitable for
model-free RL. The method thus scales to large parameter and input spaces and
nonlinear function approximators. We apply LOLA to a grid world task with an
embedded social dilemma using recurrent policies and opponent modelling. By
explicitly considering the learning of the other agent, LOLA agents learn to
cooperate out of self-interest. The code is at github.com/alshedivat/lola.

LOLA 是一种用于多智能体学习的方法，在 agent 间显式地考虑其他 agent 的学习，以达到识别和利用合作的目的。