Transferring knowledge in cross-domain reinforcement learning is a
challenging setting in which learning is accelerated by reusing knowledge from
a task with different observation and/or action space. However, it is often
necessary to carefully select the source of knowledge for the receiving end to
benefit from the transfer process. In this article, we study how to measure the
similarity between cross-domain reinforcement learning tasks to select a source
of knowledge that will improve the performance of the learning agent. We
developed a semi-supervised alignment loss to match different spaces with a set
of encoder-decoders, and use them to measure similarity and transfer policies
across tasks. In comparison to prior works, our method does not require data to
be aligned, paired or collected by expert policies. Experimental results, on a
set of varied Mujoco control tasks, show the robustness of our method in
effectively selecting and transferring knowledge, without the supervision of a
tailored set of source tasks.

通过开发半监督对齐损失来匹配不同空间的一组编码器 - 解码器，本研究研究了如何衡量跨领域强化学习任务之间的相似性，以选择能够提高学习代理性能的知识源。实验结果表明，在各种 Mujoco 控制任务中，我们的方法能够有效地选择和传递知识，而无需与专家策略相匹配、配对或收集数据的监督。

基于相似度的知识转移用于跨领域强化学习

Similarity-based Knowledge Transfer for Cross-Domain Reinforcement  Learning

Training an agent to solve control tasks directly from high-dimensional
images with model-free reinforcement learning (RL) has proven difficult. A
promising approach is to learn a latent representation together with the
control policy. However, fitting a high-capacity encoder using a scarce reward
signal is sample inefficient and leads to poor performance. Prior work has
shown that auxiliary losses, such as image reconstruction, can aid efficient
representation learning. However, incorporating reconstruction loss into an
off-policy learning algorithm often leads to training instability. We explore
the underlying reasons and identify variational autoencoders, used by previous
investigations, as the cause of the divergence. Following these findings, we
propose effective techniques to improve training stability. This results in a
simple approach capable of matching state-of-the-art model-free and model-based
algorithms on MuJoCo control tasks. Furthermore, our approach demonstrates
robustness to observational noise, surpassing existing approaches in this
setting. Code, results, and videos are anonymously available at
this https URL

通过引入辅助损失以及消除后效性的影响，提出了一种简单且有效的方法，可以在 MuJoCo 控制任务上匹配最新的无模型和有模型算法，同时在观测噪声下表现出鲁棒性，并且过来了以往使用变分自动编码器所面临的发散问题。