The advent of deep learning (DL) gave rise to significant breakthroughs in
Reinforcement Learning (RL) research. Deep Reinforcement Learning (DRL)
algorithms have reached super-human level skills when applied to vision-based
control problems as such in Atari 2600 games where environment states were
extracted from pixel information. Unfortunately, these environments are far
from being applicable to highly dynamic and complex real-world tasks as in
autonomous control of a fighter aircraft since these environments only involve
2D representation of a visual world. Here, we present a semi-realistic flight
simulation environment Harfang3D Dog-Fight Sandbox for fighter aircrafts. It is
aimed to be a flexible toolbox for the investigation of main challenges in
aviation studies using Reinforcement Learning. The program provides easy access
to flight dynamics model, environment states, and aerodynamics of the plane
enabling user to customize any specific task in order to build intelligent
decision making (control) systems via RL. The software also allows deployment
of bot aircrafts and development of multi-agent tasks. This way, multiple
groups of aircrafts can be configured to be competitive or cooperative agents
to perform complicated tasks including Dog Fight. During the experiments, we
carried out training for two different scenarios: navigating to a designated
location and within visual range (WVR) combat, shortly Dog Fight. Using Deep
Reinforcement Learning techniques for both scenarios, we were able to train
competent agents that exhibit human-like behaviours. Based on this results, it
is confirmed that Harfang3D Dog-Fight Sandbox can be utilized as a 3D realistic
RL research platform.

介绍了一个半真实的飞行模拟环境 Harfang3D Dog-Fight Sandbox，为研究利用强化学习控制飞机的主要挑战提供了一个灵活的工具箱，可以使用深度强化学习技术进行训练，进而训练出能够表现出类似于人类行为的智能代理。

Harfang3D Dog-Fight Sandbox: 战斗机定制控制任务的强化学习研究平台

Harfang3D Dog-Fight Sandbox: A Reinforcement Learning Research Platform for the Customized Control Tasks of Fighter Aircrafts

Deep reinforcement learning methods have shown great performance on many
challenging cooperative multi-agent tasks. Two main promising research
directions are multi-agent value function decomposition and multi-agent policy
gradients. In this paper, we propose a new decomposed multi-agent soft
actor-critic (mSAC) method, which effectively combines the advantages of the
aforementioned two methods. The main modules include decomposed Q network
architecture, discrete probabilistic policy and counterfactual advantage
function (optinal). Theoretically, mSAC supports efficient off-policy learning
and addresses credit assignment problem partially in both discrete and
continuous action spaces. Tested on StarCraft II micromanagement cooperative
multiagent benchmark, we empirically investigate the performance of mSAC
against its variants and analyze the effects of the different components.
Experimental results demonstrate that mSAC significantly outperforms
policy-based approach COMA, and achieves competitive results with SOTA
value-based approach Qmix on most tasks in terms of asymptotic perfomance
metric. In addition, mSAC achieves pretty good results on large action space
tasks, such as 2c_vs_64zg and MMM2.

本文提出了一种新的分解式多智能体软演员 - 批评家（mSAC）方法，在 StarCraft II 微观管理合作式多智能体基准测试中获得高效和优异的性能。

分解 Soft Actor-Critic 方法用于合作多智体强化学习

Decomposed Soft Actor-Critic Method for Cooperative Multi-Agent  Reinforcement Learning

Learning when to communicate and doing that effectively is essential in
multi-agent tasks. Recent works show that continuous communication allows
efficient training with back-propagation in multi-agent scenarios, but have
been restricted to fully-cooperative tasks. In this paper, we present
Individualized Controlled Continuous Communication Model (IC3Net) which has
better training efficiency than simple continuous communication model, and can
be applied to semi-cooperative and competitive settings along with the
cooperative settings. IC3Net controls continuous communication with a gating
mechanism and uses individualized rewards foreach agent to gain better
performance and scalability while fixing credit assignment issues. Using
variety of tasks including StarCraft BroodWars explore and combat scenarios, we
show that our network yields improved performance and convergence rates than
the baselines as the scale increases. Our results convey that IC3Net agents
learn when to communicate based on the scenario and profitability.

本文提出了 Individualized Controlled Continuous Communication Model (IC3Net)，在多智能体协作、半协作与竞争环境下，通过门控机制控制持续传输，并使用个性化奖励来提高性能和可扩展性，修正学分分配问题。实验结果证实，IC3Net 网络比基准网络在不同场景下具有更好的训练效率和收敛率，智能体基于场景和可盈利性学会如何传输信息。