In this work, our goal is to train agents that can coordinate with seen,
unseen as well as human partners in a multi-agent communication environment
involving natural language. Previous work using a single set of agents has
shown great progress in generalizing to known partners, however it struggles
when coordinating with unfamiliar agents. To mitigate that, recent work
explored the use of population-based approaches, where multiple agents interact
with each other with the goal of learning more generic protocols. These
methods, while able to result in good coordination between unseen partners,
still only achieve so in cases of simple languages, thus failing to adapt to
human partners using natural language. We attribute this to the use of static
populations and instead propose a dynamic population-based meta-learning
approach that builds such a population in an iterative manner. We perform a
holistic evaluation of our method on two different referential games, and show
that our agents outperform all prior work when communicating with seen partners
and humans. Furthermore, we analyze the natural language generation skills of
our agents, where we find that our agents also outperform strong baselines.
Finally, we test the robustness of our agents when communicating with
out-of-population agents and carefully test the importance of each component of
our method through ablation studies.

使用动态人口元学习方法来训练代理人在涉及自然语言的多智能体通信环境中与已知、未知和人类伙伴协调。在两个不同的基准博弈中进行了全面评估，并显示出我们的方法在与人类合作时优于所有先前的工作。

使用自然语言的动态基于人群的元学习，用于多智能体通信

Dynamic population-based meta-learning for multi-agent communication  with natural language

We study the asymptotic optimal control of multi-class restless bandits. A
restless bandit is a controllable stochastic process whose state evolution
depends on whether or not the bandit is made active. Since finding the optimal
control is typically intractable, we propose a class of priority policies that
are proved to be asymptotically optimal under a global attractor property and a
technical condition. We consider both a fixed population of bandits as well as
a dynamic population where bandits can depart and arrive. As an example of a
dynamic population of bandits, we analyze a multi-class $\mathit{M/M/S+M}$
queue for which we show asymptotic optimality of an index policy. We combine
fluid-scaling techniques with linear programming results to prove that when
bandits are indexable, Whittle's index policy is included in our class of
priority policies. We thereby generalize a result of Weber and Weiss [J. Appl.
Probab. 27 (1990) 637-648] about asymptotic optimality of Whittle's index
policy to settings with (i) several classes of bandits, (ii) arrivals of new
bandits and (iii) multiple actions. Indexability of the bandits is not required
for our results to hold. For nonindexable bandits, we describe how to select
priority policies from the class of asymptotically optimal policies and present
numerical evidence that, outside the asymptotic regime, the performance of our
proposed priority policies is nearly optimal.

本文研究了多类不定期赌徒的渐近最优控制问题，并提出了一类优先级策略，证明了在全局吸引子属性和技术条件下其是渐近最优的。我们将流体缩放技术与线性规划结果相结合，证明了当赌徒可索引时，Whittle 的索引策略包含在我们的一类优先级策略中。我们总结提出一些结论，包括关于如何选择来自渐近最优策略类的优先级策略等方面。

可索引和不可索引的不安定赌博机的渐进最优优先策略

Asymptotically optimal priority policies for indexable and nonindexable  restless bandits

We study the quality of outcomes in repeated games when the population of
players is dynamically changing and participants use learning algorithms to
adapt to the changing environment. Game theory classically considers Nash
equilibria of one-shot games, while in practice many games are played
repeatedly, and in such games players often use algorithmic tools to learn to
play in the given environment. Most previous work on learning in repeated games
assumes that the population playing the game is static over time.
We analyze the efficiency of repeated games in dynamically changing
environments, motivated by application domains such as Internet ad-auctions and
packet routing. We prove that, in many classes of games, if players choose
their strategies in a way that guarantees low adaptive regret, then high social
welfare is ensured, even under very frequent changes. In fact, in large markets
learning players achieve asymptotically optimal social welfare despite high
turnover. Previous work has only showed that high welfare is guaranteed for
learning outcomes in static environments. Our work extends these results to
more realistic settings when participation is drastically evolving over time.

研究动态变化的人群中使用学习算法适应变化环境的重复博弈的质量，证明了如果玩家以一种保证低自适应遗憾的方式选择策略，则在许多种类的游戏中，即使变化非常频繁，也可以确保高社会福利，这一点比以前的工作更具现实意义。