This paper presents a novel RL algorithm, S-REINFORCE, which is designed to
generate interpretable policies for dynamic decision-making tasks. The proposed
algorithm leverages two types of function approximators, namely Neural Network
(NN) and Symbolic Regressor (SR), to produce numerical and symbolic policies,
respectively. The NN component learns to generate a numerical probability
distribution over the possible actions using a policy gradient, while the SR
component captures the functional form that relates the associated states with
the action probabilities. The SR-generated policy expressions are then utilized
through importance sampling to improve the rewards received during the learning
process. We have tested the proposed S-REINFORCE algorithm on various dynamic
decision-making problems with low and high dimensional action spaces, and the
results demonstrate its effectiveness and impact in achieving interpretable
solutions. By leveraging the strengths of both NN and SR, S-REINFORCE produces
policies that are not only well-performing but also easy to interpret, making
it an ideal choice for real-world applications where transparency and causality
are crucial.

该研究提出一种新的强化学习算法 S-REINFORCE，旨在为动态决策任务产生可解释的策略，该算法利用神经网络（NN）和符号回归器（SR）两种类型的函数逼近器来生成数值和符号策略，分别捕捉 NN 组件学到的生成可能动作的数值概率分布以及 SR 组件捕捉关联状态和动作概率之间功能形式，并将其结合起来从而实现对决策问题的求解。实验结果表明 S-REINFORCE 算法在低维度和高维度决策空间的动态决策问题上都具有高效性和影响力，并且所得到的策略不仅性能良好，而且容易理解，是透明度和因果关系至关重要的实际应用的理想选择。