Programmatically Interpretable Reinforcement Learning (PIRL) encodes policies
in human-readable computer programs. Novel algorithms were recently introduced
with the goal of handling the lack of gradient signal to guide the search in
the space of programmatic policies. Most of such PIRL algorithms first train a
neural policy that is used as an oracle to guide the search in the programmatic
space. In this paper, we show that such PIRL-specific algorithms are not
needed, depending on the language used to encode the programmatic policies.
This is because one can use actor-critic algorithms to directly obtain a
programmatic policy. We use a connection between ReLU neural networks and
oblique decision trees to translate the policy learned with actor-critic
algorithms into programmatic policies. This translation from ReLU networks
allows us to synthesize policies encoded in programs with if-then-else
structures, linear transformations of the input values, and PID operations.
Empirical results on several control problems show that this translation
approach is capable of learning short and effective policies. Moreover, the
translated policies are at least competitive and often far superior to the
policies PIRL algorithms synthesize.

在这篇论文中，我们展示了使用 actor-critic 算法将从 actor-critic 算法学习到的策略转化为以程序形式编码的策略的连接，以此避免了需要使用特定于 PIRL 的算法的问题。实证结果表明，这种转化方法能够学习出简短而有效的策略，并且这些转化后的策略至少具有与 PIRL 算法相竞争的水平，往往更优秀。

使用演员 - 评论算法和 ReLU 网络合成程序策略

Synthesizing Programmatic Policies with Actor-Critic Algorithms and ReLU  Networks

Reinforcement learning (RL) agents improve through trial-and-error, but when
reward is sparse and the agent cannot discover successful action sequences,
learning stagnates. This has been a notable problem in training deep RL agents
to perform web-based tasks, such as booking flights or replying to emails,
where a single mistake can ruin the entire sequence of actions. A common remedy
is to "warm-start" the agent by pre-training it to mimic expert demonstrations,
but this is prone to overfitting. Instead, we propose to constrain exploration
using demonstrations. From each demonstration, we induce high-level "workflows"
which constrain the allowable actions at each time step to be similar to those
in the demonstration (e.g., "Step 1: click on a textbox; Step 2: enter some
text"). Our exploration policy then learns to identify successful workflows and
samples actions that satisfy these workflows. Workflows prune out bad
exploration directions and accelerate the agent's ability to discover rewards.
We use our approach to train a novel neural policy designed to handle the
semi-structured nature of websites, and evaluate on a suite of web tasks,
including the recent World of Bits benchmark. We achieve new state-of-the-art
results, and show that workflow-guided exploration improves sample efficiency
over behavioral cloning by more than 100x.

使用演示进行探索约束的工作流引导探索算法提高了强化学习智能体在面向网络任务中的效率