Many reinforcement learning (RL) applications have combinatorial action
spaces, where each action is a composition of sub-actions. A standard RL
approach ignores this inherent factorization structure, resulting in a
potential failure to make meaningful inferences about rarely observed
sub-action combinations; this is particularly problematic for offline settings,
where data may be limited. In this work, we propose a form of linear Q-function
decomposition induced by factored action spaces. We study the theoretical
properties of our approach, identifying scenarios where it is guaranteed to
lead to zero bias when used to approximate the Q-function. Outside the regimes
with theoretical guarantees, we show that our approach can still be useful
because it leads to better sample efficiency without necessarily sacrificing
policy optimality, allowing us to achieve a better bias-variance trade-off.
Across several offline RL problems using simulators and real-world datasets
motivated by healthcare, we demonstrate that incorporating factored action
spaces into value-based RL can result in better-performing policies. Our
approach can help an agent make more accurate inferences within underexplored
regions of the state-action space when applying RL to observational datasets.

本文研究了如何在强化学习的组合行为空间中通过线性 Q 函数分解来更好地处理少见子行动组合的情况，并对该方法进行了理论分析和实验评估，证明了它可以提高数据效率和策略优化的性能。

利用分解的行动空间实现医疗保健中高效的离线强化学习

Leveraging Factored Action Spaces for Efficient Offline Reinforcement  Learning in Healthcare

A practical challenge in reinforcement learning are combinatorial action
spaces that make planning computationally demanding. For example, in
cooperative multi-agent reinforcement learning, a potentially large number of
agents jointly optimize a global reward function, which leads to a
combinatorial blow-up in the action space by the number of agents. As a minimal
requirement, we assume access to an argmax oracle that allows to efficiently
compute the greedy policy for any Q-function in the model class. Building on
recent work in planning with local access to a simulator and linear function
approximation, we propose efficient algorithms for this setting that lead to
polynomial compute and query complexity in all relevant problem parameters. For
the special case where the feature decomposition is additive, we further
improve the bounds and extend the results to the kernelized setting with an
efficient algorithm.

本篇研究论文是关于如何在具有组合行动空间的多智能体强化学习中，通过访问 argmax oracle 并建立在线模拟和线性函数逼近的最小要求，提出了一种高效的算法，以在所有相关问题参数的多项式计算和查询复杂度内实现计划。

组合动作空间中的高效规划及其在合作多智能体强化学习中的应用

Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning

Prior AI successes in complex games have largely focused on settings with at
most hundreds of actions at each decision point. In contrast, Diplomacy is a
game with more than 10^20 possible actions per turn. Previous attempts to
address games with large branching factors, such as Diplomacy, StarCraft, and
Dota, used human data to bootstrap the policy or used handcrafted reward
shaping. In this paper, we describe an algorithm for action exploration and
equilibrium approximation in games with combinatorial action spaces. This
algorithm simultaneously performs value iteration while learning a policy
proposal network. A double oracle step is used to explore additional actions to
add to the policy proposals. At each state, the target state value and policy
for the model training are computed via an equilibrium search procedure. Using
this algorithm, we train an agent, DORA, completely from scratch for a popular
two-player variant of Diplomacy and show that it achieves superhuman
performance. Additionally, we extend our methods to full-scale no-press
Diplomacy and for the first time train an agent from scratch with no human
data. We present evidence that this agent plays a strategy that is incompatible
with human-data bootstrapped agents. This presents the first strong evidence of
multiple equilibria in Diplomacy and suggests that self play alone may be
insufficient for achieving superhuman performance in Diplomacy.

本文介绍了一种在组合动作空间游戏中进行动作探索和平衡逼近的算法，该算法同时执行价值迭代和学习策略建议网络。 我们使用这个算法，训练了一种名为 DORA 的代理，完全从零开始，它在人类玩家之上展现了超人类的表现，这是对 “Diplomacy” 中多个均衡的首个强有力的证据，表明单靠自我对战可能不足以达到超人类水平。

从零开始的非正式外交

No-Press Diplomacy from Scratch

A hallmark of human intelligence is the ability to understand and communicate
with language. Interactive Fiction games are fully text-based simulation
environments where a player issues text commands to effect change in the
environment and progress through the story. We argue that IF games are an
excellent testbed for studying language-based autonomous agents. In particular,
IF games combine challenges of combinatorial action spaces, language
understanding, and commonsense reasoning. To facilitate rapid development of
language-based agents, we introduce Jericho, a learning environment for
man-made IF games and conduct a comprehensive study of text-agents across a
rich set of games, highlighting directions in which agents can improve.

介绍了 Jericho，IF 游戏的学习环境，并通过对丰富游戏集合上的文本代理进行全面研究，强调了代理可以提高的方向。