Automated theorem proving in large theories can be learned via reinforcement learning over an indefinitely growing action space. In order to select actions, one performs nearest neighbor lookups in the knowledge base to find premises to be applied. Here we address the exploration for reinforcement learning in this space. Approaches (like epsilon-greedy strategy) that sample actions uniformly do not scale to this scenario as most actions lead to dead ends and unsuccessful proofs which are not useful for training our models. In this paper, we compare approaches that select premises using randomly initialized similarity measures and mixing them with the proposals of the learned model. We evaluate these on the HOList benchmark for tactics based higher order theorem proving. We implement an automated theorem prover named DeepHOL-Zero that does not use any of the human proofs and show that our improved exploration method manages to expand the training set continuously. DeepHOL-Zero outperforms the best theorem prover trained by imitation learning alone.

本文介绍如何在大型知识库的前提下进行自动定理证明，并通过深度强化学习技术中的基于词频-逆文档频率的查找开发出了一个混合陈述选择方法，以帮助探索并了解哪些前提适用于新的定理证明。实验表明，使用该方法进行训练的定理证明器优于仅以人类证明为基础的证明器，并可以接近于用模仿和强化学习相结合进行训练的证明器的性能。我们还通过多次实验来理解我们的探索方法背后的假设重要性，并解释了我们的设计选择。

在大型理论中学习推理，无需模仿