Multi-Agent Reinforcement Learning (MARL) algorithms are widely adopted in
tackling complex tasks that require collaboration and competition among agents
in dynamic Multi-Agent Systems (MAS). However, learning such tasks from scratch
is arduous and may not always be feasible, particularly for MASs with a large
number of interactive agents due to the extensive sample complexity. Therefore,
reusing knowledge gained from past experiences or other agents could
efficiently accelerate the learning process and upscale MARL algorithms. In
this study, we introduce a novel framework that enables transfer learning for
MARL through unifying various state spaces into fixed-size inputs that allow
one unified deep-learning policy viable in different scenarios within a MAS. We
evaluated our approach in a range of scenarios within the StarCraft Multi-Agent
Challenge (SMAC) environment, and the findings show significant enhancements in
multi-agent learning performance using maneuvering skills learned from other
scenarios compared to agents learning from scratch. Furthermore, we adopted
Curriculum Transfer Learning (CTL), enabling our deep learning policy to
progressively acquire knowledge and skills across pre-designed homogeneous
learning scenarios organized by difficulty levels. This process promotes inter-
and intra-agent knowledge transfer, leading to high multi-agent learning
performance in more complicated heterogeneous scenarios.

通过将各种状态空间统一为固定大小的输入，以便在 MAS 中的不同场景中使用一种统一的深度学习策略，我们介绍了一种新的框架，使得多智能体强化学习能够进行迁移学习。在 StarCraft Multi-Agent Challenge（SMAC）环境中，通过从其他场景学习到的机动技能，相比于从头学习的智能体，我们的方法在多智能体学习性能方面取得了显著的提升。此外，通过采用课程式迁移学习（CTL），使我们的深度学习策略逐步获取各个预先设计的同质学习场景中的知识和技能，促进智能体之间和智能体内部的知识传递，从而在更复杂的异质场景中实现高水平的多智能体学习性能。

基于情景无关表征实现多智能体迁移强化学习

Enabling Multi-Agent Transfer Reinforcement Learning via Scenario  Independent Representation

Extensive utilization of deep reinforcement learning (DRL) policy networks in
diverse continuous control tasks has raised questions regarding performance
degradation in expansive state spaces where the input state norm is larger than
that in the training environment. This paper aims to uncover the underlying
factors contributing to such performance deterioration when dealing with
expanded state spaces, using a novel analysis technique known as state
division. In contrast to prior approaches that employ state division merely as
a post-hoc explanatory tool, our methodology delves into the intrinsic
characteristics of DRL policy networks. Specifically, we demonstrate that the
expansion of state space induces the activation function $\tanh$ to exhibit
saturability, resulting in the transformation of the state division boundary
from nonlinear to linear. Our analysis centers on the paradigm of the
double-integrator system, revealing that this gradual shift towards linearity
imparts a control behavior reminiscent of bang-bang control. However, the
inherent linearity of the division boundary prevents the attainment of an ideal
bang-bang control, thereby introducing unavoidable overshooting. Our
experimental investigations, employing diverse RL algorithms, establish that
this performance phenomenon stems from inherent attributes of the DRL policy
network, remaining consistent across various optimization algorithms.

利用深度强化学习（DRL）策略网络在各种连续控制任务中的广泛应用引发了关于在输入状态规范大于训练环境中的状态规范的广泛状态空间中性能下降的问题。本文旨在使用一种称为状态划分的新型分析技术揭示处理扩展状态空间时导致性能恶化的潜在因素，与之前仅将状态划分作为事后解释工具的方法相比，我们的方法深入研究了 DRL 策略网络的内在特性。具体而言，我们证明状态空间的扩展会导致激活函数 tanh 表现出饱和性，从而使状态划分边界从非线性变为线性。我们的分析以双积分器系统为中心，揭示了这种逐渐向线性偏移的控制行为类似于鲍姆 - 鲍姆控制。然而，划分边界的固有线性性阻止了理想鲍姆 - 鲍姆控制的实现，从而引入了不可避免的过冲。我们的实验研究采用了各种强化学习算法，确定了这种性能现象源于 DRL 策略网络的固有属性，在各种优化算法中保持一致。

政策网络的泛化分析：双积分器的案例

Generalization Analysis of Policy Networks: An Example of  Double-Integrator

Obtaining first-order regret bounds -- regret bounds scaling not as the
worst-case but with some measure of the performance of the optimal policy on a
given instance -- is a core question in sequential decision-making. While such
bounds exist in many settings, they have proven elusive in reinforcement
learning with large state spaces. In this work we address this gap, and show
that it is possible to obtain regret scaling as
$\widetilde{\mathcal{O}}(\sqrt{d^3 H^3 \cdot V_1^\star \cdot K} +
d^{3.5}H^3\log K )$ in reinforcement learning with large state spaces, namely
the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and
$K$ is the number of episodes. We demonstrate that existing techniques based on
least squares estimation are insufficient to obtain this result, and instead
develop a novel robust self-normalized concentration bound based on the robust
Catoni mean estimator, which may be of independent interest.

本研究基于鲁棒 Catoni 平均值估计器，提出一种新的鲁棒自归一化浓度界，解决了已有技术在大状态空间强化学习中无法获得遗憾上界的问题，并证明了在线性 MDP 设定下，可以获得与最优策略性能某种度量成比例的遗憾上界。