Reinforcement learning (RL) can enable sequential decision-making in complex and high-dimensional environments if the acquisition of a new state-action pair is efficient, i.e., when interaction with the environment is inexpensive. However, there are a myriad of real-world applications in which a high number of interactions are infeasible. In these environments, transfer RL algorithms, which can be used for the transfer of knowledge from one or multiple source environments to a target environment, have been shown to increase learning speed and improve initial and asymptotic performance. However, most existing transfer RL algorithms are on-policy and sample inefficient, and often require heuristic choices in algorithm design. This paper proposes an off-policy Advantage-based Policy Transfer algorithm, APT-RL, for fixed domain environments. Its novelty is in using the popular notion of ``advantage'' as a regularizer, to weigh the knowledge that should be transferred from the source, relative to new knowledge learned in the target, removing the need for heuristic choices. Further, we propose a new transfer performance metric to evaluate the performance of our algorithm and unify existing transfer RL frameworks. Finally, we present a scalable, theoretically-backed task similarity measurement algorithm to illustrate the alignments between our proposed transferability metric and similarities between source and target environments. Numerical experiments on three continuous control benchmark tasks demonstrate that APT-RL outperforms existing transfer RL algorithms on most tasks, and is $10\%$ to $75\%$ more sample efficient than learning from scratch.

本文提出了一种基于收益的策略转移算法 APT-RL，用于在固定领域环境中的强化学习，通过使用“优势”作为正则项，避免了启发式选择算法设计，并提出了一种新的转移性能度量来评估算法的性能并统一现有的转移强化学习框架，实验证明在大多数任务上 APT-RL 的性能优于现有的转移强化学习算法，并且比从零开始学习更加高效。

一种基于优势的强化学习策略迁移算法及其可迁移性度量