The main challenge of multiagent reinforcement learning is the difficulty of
learning useful policies in the presence of other simultaneously learning
agents whose changing behaviors jointly affect the environment's transition and
reward dynamics. An effective approach that has recently emerged for addressing
this non-stationarity is for each agent to anticipate the learning of other
agents and influence the evolution of future policies towards desirable
behavior for its own benefit. Unfortunately, previous approaches for achieving
this suffer from myopic evaluation, considering only a finite number of policy
updates. As such, these methods can only influence transient future policies
rather than achieving the promise of scalable equilibrium selection approaches
that influence the behavior at convergence. In this paper, we propose a
principled framework for considering the limiting policies of other agents as
time approaches infinity. Specifically, we develop a new optimization objective
that maximizes each agent's average reward by directly accounting for the
impact of its behavior on the limiting set of policies that other agents will
converge to. Our paper characterizes desirable solution concepts within this
problem setting and provides practical approaches for optimizing over possible
outcomes. As a result of our farsighted objective, we demonstrate better
long-term performance than state-of-the-art baselines across a suite of diverse
multiagent benchmark domains.

本文提出了一个基于 farsighted objective 的新优化目标以及一种新的多智能体强化学习方法，实现了优于现有基线结果的长期性能。