This paper addresses the problem of learning optimal control policies for systems with uncertain dynamics and high-level control objectives specified as Linear Temporal Logic (LTL) formulas. Uncertainty is considered in the workspace structure and the outcomes of control decisions giving rise to an unknown Markov Decision Process (MDP). Existing reinforcement learning (RL) algorithms for LTL tasks typically rely on exploring a product MDP state-space uniformly (using e.g., an $\epsilon$-greedy policy) compromising sample-efficiency. This issue becomes more pronounced as the rewards get sparser and the MDP size or the task complexity increase. In this paper, we propose an accelerated RL algorithm that can learn control policies significantly faster than competitive approaches. Its sample-efficiency relies on a novel task-driven exploration strategy that biases exploration towards directions that may contribute to task satisfaction. We provide theoretical analysis and extensive comparative experiments demonstrating the sample-efficiency of the proposed method. The benefit of our method becomes more evident as the task complexity or the MDP size increases.

本文研究在不确定动态系统中学习最佳控制策略的问题，其中高层控制目标由线性时序逻辑（LTL）公式指定。研究提出了一种加速的强化学习算法，采用新颖的任务驱动探索策略，提高了样本效率，尤其在任务复杂性或马尔可夫决策过程（MDP）规模增大时更为显著。通过理论分析和实验证明，该方法能够显著快于现有竞争策略。

基于时序逻辑目标的样本高效强化学习：利用任务规范指导探索