Learning to cooperate is crucially important in multi-agent environments. The
key is to understand the mutual interplay between agents. However, multi-agent
environments are highly dynamic, where agents keep moving and their neighbors
change quickly. This makes it hard to learn abstract representations of mutual
interplay between agents. To tackle these difficulties, we propose graph
convolutional reinforcement learning, where graph convolution adapts to the
dynamics of the underlying graph of the multi-agent environment, and relation
kernels capture the interplay between agents by their relation representations.
Latent features produced by convolutional layers from gradually increased
receptive fields are exploited to learn cooperation, and cooperation is further
improved by temporal relation regularization for consistency. Empirically, we
show that our method substantially outperforms existing methods in a variety of
cooperative scenarios.

本文提出了一种使用图卷积强化学习的方法，通过使用关系内核捕获代理之间的相互作用来适应多代理环境的动态，并利用逐渐增大感受野的卷积层产生潜在特征来学习合作，此外，为了保持一致性，还使用了时间关系正则化方法。实验表明，该方法在各种合作场景中显着优于现有方法。

图卷积强化学习

Graph Convolutional Reinforcement Learning

Reinforcement learning (RL) problems are often phrased in terms of Markov
decision processes (MDPs). In this thesis we go beyond MDPs and consider RL in
environments that are non-Markovian, non-ergodic and only partially observable.
Our focus is not on practical algorithms, but rather on the fundamental
underlying problems: How do we balance exploration and exploitation? How do we
explore optimally? When is an agent optimal? We follow the nonparametric
realizable paradigm.
We establish negative results on Bayesian RL agents, in particular AIXI. We
show that unlucky or adversarial choices of the prior cause the agent to
misbehave drastically. Therefore Legg-Hutter intelligence and balanced Pareto
optimality, which depend crucially on the choice of the prior, are entirely
subjective. Moreover, in the class of all computable environments every policy
is Pareto optimal. This undermines all existing optimality properties for AIXI.
However, there are Bayesian approaches to general RL that satisfy objective
optimality guarantees: We prove that Thompson sampling is asymptotically
optimal in stochastic environments in the sense that its value converges to the
value of the optimal policy. We connect asymptotic optimality to regret given a
recoverability assumption on the environment that allows the agent to recover
from mistakes. Hence Thompson sampling achieves sublinear regret in these
environments.
Our results culminate in a formal solution to the grain of truth problem: A
Bayesian agent acting in a multi-agent environment learns to predict the other
agents' policies if its prior assigns positive probability to them (the prior
contains a grain of truth). We construct a large but limit computable class
containing a grain of truth and show that agents based on Thompson sampling
over this class converge to play Nash equilibria in arbitrary unknown
computable multi-agent environments.

本文提出了在非 Markovian、非 ergodic 且只部分可观察的环境下进行强化学习的问题。作者建立了贝叶斯强化学习代理的负面结果，并证明 Thompson 采样在随机环境中是渐进最优的。此外，作者构建了一个大但可计算的类，展示了基于 Thompson 采样的代理在这个类中收敛于任意未知可计算多智能体环境中的纳什均衡。

非参数通用强化学习

Nonparametric General Reinforcement Learning

Recent advances in Bayesian reinforcement learning (BRL) have shown that
Bayes-optimality is theoretically achievable by modeling the environment's
latent dynamics using Flat-Dirichlet-Multinomial (FDM) prior. In
self-interested multi-agent environments, the transition dynamics are mainly
controlled by the other agent's stochastic behavior for which FDM's
independence and modeling assumptions do not hold. As a result, FDM does not
allow the other agent's behavior to be generalized across different states nor
specified using prior domain knowledge. To overcome these practical limitations
of FDM, we propose a generalization of BRL to integrate the general class of
parametric models and model priors, thus allowing practitioners' domain
knowledge to be exploited to produce a fine-grained and compact representation
of the other agent's behavior. Empirical evaluation shows that our approach
outperforms existing multi-agent reinforcement learning algorithms.

提出了一种推广的贝叶斯强化学习方法，通过整合常见的参数模型和模型先验，实现了在自利多智能体环境中对其他智能体行为的精细和简洁表示，具有比现有方法更好的性能。