Deep reinforcement learning (DRL) has led to a wide range of advances in
sequential decision-making tasks. However, the complexity of neural network
policies makes it difficult to understand and deploy with limited computational
resources. Currently, employing compact symbolic expressions as symbolic
policies is a promising strategy to obtain simple and interpretable policies.
Previous symbolic policy methods usually involve complex training processes and
pre-trained neural network policies, which are inefficient and limit the
application of symbolic policies. In this paper, we propose an efficient
gradient-based learning method named Efficient Symbolic Policy Learning (ESPL)
that learns the symbolic policy from scratch in an end-to-end way. We introduce
a symbolic network as the search space and employ a path selector to find the
compact symbolic policy. By doing so we represent the policy with a
differentiable symbolic expression and train it in an off-policy manner which
further improves the efficiency. In addition, in contrast with previous
symbolic policies which only work in single-task RL because of complexity, we
expand ESPL on meta-RL to generate symbolic policies for unseen tasks.
Experimentally, we show that our approach generates symbolic policies with
higher performance and greatly improves data efficiency for single-task RL. In
meta-RL, we demonstrate that compared with neural network policies the proposed
symbolic policy achieves higher performance and efficiency and shows the
potential to be interpretable.

通过提出一种名为 Efficient Symbolic Policy Learning（ESPL）的高效梯度学习方法，在深度强化学习中实现从头开始学习符号策略，并扩展至元强化学习，生成出性能更高、效率更高且具有潜力解释的符号策略。