Game-playing agents like AlphaGo have achieved superhuman performance through
self-play, which is theoretically guaranteed to yield optimal policies in
competitive games. However, most language tasks are partially or fully
cooperative, so it is an open question whether techniques like self-play can
effectively be used to improve language models. We empirically investigate this
question in a negotiation game setting known as Deal or No Deal (DoND).
Crucially, the objective in DoND can be modified to produce a fully cooperative
game, a strictly competitive one, or anything in between. We finetune language
models in self-play over multiple rounds of filtered behavior cloning in DoND
for each of these objectives. Contrary to expectations, we find that language
model self-play leads to significant performance gains in both cooperation and
competition with humans, suggesting that self-play and related techniques have
promise despite a lack of theoretical guarantees.

通过对《Deal or No Deal》协商游戏进行自我对弈，我们发现语言模型的自我对弈在合作与竞争方面都能显著提升性能，暗示了自我对弈和相关技术的潜力。

语言模型自对弈在非零和博弈中的效果

Efficacy of Language Model Self-Play in Non-Zero-Sum Games

In many board games and other abstract games, patterns have been used as
features that can guide automated game-playing agents. Such patterns or
features often represent particular configurations of pieces, empty positions,
etc., which may be relevant for a game's strategies. Their use has been
particularly prevalent in the game of Go, but also many other games used as
benchmarks for AI research. Simple, linear policies of such features are
unlikely to produce state-of-the-art playing strength like the deep neural
networks that have been more commonly used in recent years do. However, they
typically require significantly fewer resources to train, which is paramount
for large-scale studies of hundreds to thousands of distinct games. In this
paper, we formulate a design and efficient implementation of spatial
state-action features for general games. These are patterns that can be trained
to incentivise or disincentivise actions based on whether or not they match
variables of the state in a local area around action variables. We provide
extensive details on several design and implementation choices, with a primary
focus on achieving a high degree of generality to support a wide variety of
different games using different board geometries or other graphs. Secondly, we
propose an efficient approach for evaluating active features for any given set
of features. In this approach, we take inspiration from heuristics used in
problems such as SAT to optimise the order in which parts of patterns are
matched and prune unnecessary evaluations. An empirical evaluation on 33
distinct games in the Ludii general game system demonstrates the efficiency of
this approach in comparison to a naive baseline, as well as a baseline based on
prefix trees.

本文提出了一种可有效设计和实现通用游戏空间状态 - 动作特征的方法，并提供了适用于广泛不同游戏使用的设计和具体实现，这种方法可以训练出符合本地区域状态变量的特征，以此来激励或者抑制动作。