Policy distillation, which transfers a teacher policy to a student policy has
achieved great success in challenging tasks of deep reinforcement learning.
This teacher-student framework requires a well-trained teacher model which is
computationally expensive. Moreover, the performance of the student model could
be limited by the teacher model if the teacher model is not optimal. In the
light of collaborative learning, we study the feasibility of involving joint
intellectual efforts from diverse perspectives of student models. In this work,
we introduce dual policy distillation(DPD), a student-student framework in
which two learners operate on the same environment to explore different
perspectives of the environment and extract knowledge from each other to
enhance their learning. The key challenge in developing this dual learning
framework is to identify the beneficial knowledge from the peer learner for
contemporary learning-based reinforcement learning algorithms, since it is
unclear whether the knowledge distilled from an imperfect and noisy peer
learner would be helpful. To address the challenge, we theoretically justify
that distilling knowledge from a peer learner will lead to policy improvement
and propose a disadvantageous distillation strategy based on the theoretical
results. The conducted experiments on several continuous control tasks show
that the proposed framework achieves superior performance with a learning-based
agent and function approximation without the use of expensive teacher models.

本文提出了一种双学习者的框架，名为双重策略蒸馏（DPD），其中两个学习者在同一环境中运行，以探索环境的不同方面并相互提取知识以增强他们的学习，并对几个连续控制任务进行实验，表明该框架可以在没有使用昂贵的教师模型的情况下，在具有学习基础的代理和函数逼近的情况下获得优越的性能。