Modeling of real-world biological multi-agents is a fundamental problem in
various scientific and engineering fields. Reinforcement learning (RL) is a
powerful framework to generate flexible and diverse behaviors in cyberspace;
however, when modeling real-world biological multi-agents, there is a domain
gap between behaviors in the source (i.e., real-world data) and the target
(i.e., cyberspace for RL), and the source environment parameters are usually
unknown. In this paper, we propose a method for adaptive action supervision in
RL from real-world demonstrations in multi-agent scenarios. We adopt an
approach that combines RL and supervised learning by selecting actions of
demonstrations in RL based on the minimum distance of dynamic time warping for
utilizing the information of the unknown source dynamics. This approach can be
easily applied to many existing neural network architectures and provide us
with an RL model balanced between reproducibility as imitation and
generalization ability to obtain rewards in cyberspace. In the experiments,
using chase-and-escape and football tasks with the different dynamics between
the unknown source and target environments, we show that our approach achieved
a balance between the reproducibility and the generalization ability compared
with the baselines. In particular, we used the tracking data of professional
football players as expert demonstrations in football and show successful
performances despite the larger gap between behaviors in the source and target
environments than the chase-and-escape task.

本文提出了一种自适应动作监督的 RL 方法，通过动态时间规整的最小距离选择 RL 真实世界演示中的动作，使得 RL 模型能够在网络空间获得回报