We study Markov decision processes (MDPs), where agents have direct control
over when and how they gather information, as formalized by action-contingent
noiselessly observable MDPs (ACNO-MPDs). In these models, actions consist of
two components: a control action that affects the environment, and a
measurement action that affects what the agent can observe. To solve ACNO-MDPs,
we introduce the act-then-measure (ATM) heuristic, which assumes that we can
ignore future state uncertainty when choosing control actions. We show how
following this heuristic may lead to shorter policy computation times and prove
a bound on the performance loss incurred by the heuristic. To decide whether or
not to take a measurement action, we introduce the concept of measuring value.
We develop a reinforcement learning algorithm based on the ATM heuristic, using
a Dyna-Q variant adapted for partially observable domains, and showcase its
superior performance compared to prior methods on a number of
partially-observable environments.

本文研究马尔可夫决策过程中的行动 - 条件无噪声可观察 MDS（ACNO-MPDs），提出了基于 “先控制再观察” 启发式的强化学习算法，并在部分可观察环境中展示了其卓越性能。

执行 - 衡量策略：在活跃衡量中的部分可观测环境中强化学习

Act-Then-Measure: Reinforcement Learning for Partially Observable Environments with Active Measuring

Recent developments in the field of model-based RL have proven successful in
a range of environments, especially ones where planning is essential. However,
such successes have been limited to deterministic fully-observed environments.
We present a new approach that handles stochastic and partially-observable
environments. Our key insight is to use discrete autoencoders to capture the
multiple possible effects of an action in a stochastic environment. We use a
stochastic variant of Monte Carlo tree search to plan over both the agent's
actions and the discrete latent variables representing the environment's
response. Our approach significantly outperforms an offline version of MuZero
on a stochastic interpretation of chess where the opponent is considered part
of the environment. We also show that our approach scales to DeepMind Lab, a
first-person 3D environment with large visual observations and partial
observability.

使用离散自编码器来处理动作在随机环境中引起的多种可能性，再结合随机版 Monte Carlo 树搜索算法规划代理的动作和代表环境反应的离散潜变量，明显优于 MuZero 在处理随机国际象棋和 DeepMind Lab 等部分观测模型的 RL 问题中的表现。

用于规划的矢量量化模型

Vector Quantized Models for Planning

We propose a targeted communication architecture for multi-agent
reinforcement learning, where agents learn both what messages to send and whom
to address them to while performing cooperative tasks in partially-observable
environments. This targeting behavior is learnt solely from downstream
task-specific reward without any communication supervision. We additionally
augment this with a multi-round communication approach where agents coordinate
via multiple rounds of communication before taking actions in the environment.
We evaluate our approach on a diverse set of cooperative multi-agent tasks, of
varying difficulties, with varying number of agents, in a variety of
environments ranging from 2D grid layouts of shapes and simulated traffic
junctions to 3D indoor environments, and demonstrate the benefits of targeted
and multi-round communication. Moreover, we show that the targeted
communication strategies learned by agents are interpretable and intuitive.
Finally, we show that our architecture can be easily extended to mixed and
competitive environments, leading to improved performance and sample complexity
over recent state-of-the-art approaches.

本文提出了一种针对多智能体强化学习的有针对性通信架构，智能体在部分可见环境中执行协作任务时学习如何发送信息和将其发送给谁。该方法在没有通信监督的情况下，仅通过下游任务特定的奖励来学习定向行为。此外，我们通过多轮通信方法增强智能体之间的协调，以更好地适应不断变化的环境。我们在各种环境和任务中的测试结果证明了有针对性和多轮通信的优势，并且所学的定向通信策略可解释性和直观性。最后，我们表明我们的架构可以轻松扩展到混合和竞争环境中，从而提高性能和样本复杂性。