The search for interpretable reinforcement learning policies is of high
academic and industrial interest. Especially for industrial systems, domain
experts are more likely to deploy autonomously learned controllers if they are
understandable and convenient to evaluate. Basic algebraic equations are
supposed to meet these requirements, as long as they are restricted to an
adequate complexity. Here we introduce the genetic programming for
reinforcement learning (GPRL) approach based on model-based batch reinforcement
learning and genetic programming, which autonomously learns policy equations
from pre-existing default state-action trajectory samples. GPRL is compared to
a straight-forward method which utilizes genetic programming for symbolic
regression, yielding policies imitating an existing well-performing, but
non-interpretable policy. Experiments on three reinforcement learning
benchmarks, i.e., mountain car, cart-pole balancing, and industrial benchmark,
demonstrate the superiority of our GPRL approach compared to the symbolic
regression method. GPRL is capable of producing well-performing interpretable
reinforcement learning policies from pre-existing default trajectory data.

通过基于遗传编程的模型驱动批量强化学习，我们介绍了 GPRL 方法，可以从现有的默认状态 - 动作轨迹样本中自主学习策略方程，实验数据表明，相较于符号回归方法，GPRL 能够从现有默认轨迹数据中生产高性能，可解释的强化学习策略。