Supervised learning is often computationally easy in practice. But to what extent does this mean that other modes of learning, such as reinforcement learning (RL), ought to be computationally easy by extension? In this work we show the first cryptographic separation between RL and supervised learning, by exhibiting a class of block MDPs and associated decoding functions where reward-free exploration is provably computationally harder than the associated regression problem. We also show that there is no computationally efficient algorithm for reward-directed RL in block MDPs, even when given access to an oracle for this regression problem. It is known that being able to perform regression in block MDPs is necessary for finding a good policy; our results suggest that it is not sufficient. Our separation lower bound uses a new robustness property of the Learning Parities with Noise (LPN) hardness assumption, which is crucial in handling the dependent nature of RL data. We argue that separations and oracle lower bounds, such as ours, are a more meaningful way to prove hardness of learning because the constructions better reflect the practical reality that supervised learning by itself is often not the computational bottleneck.

监督学习通常在实践中具有较低的计算复杂性，但这是否意味着其他学习模式，如强化学习（RL），也应该具有类似的计算简易性呢？本文展示了RL和监督学习之间的首个密码学分离，通过展示一类块MDP和相关译码函数，在无奖励的探索上的计算难度被证明比相关的回归问题更高。我们还展示了，在块MDP上，即使能够对回归问题访问一个预测的RL算法，也没有计算效率高的奖励导向RL算法。我们的结果表明，能够在块MDP上执行回归是找到一个好策略所必需的，但并不充分。我们的分离下界利用了含有噪声的学习奇偶性性质（LPN）的健壮性，这在处理RL数据的相关性质方面至关重要。我们认为，此类分离和预测下界能更好地反映出监督学习本身通常并非计算瓶颈的实际现实，从而更加有意义地证明学习的困难性。

探索比预测更困难：将强化学习与监督学习加密分离