We study the problem of nonepisodic reinforcement learning (RL) for nonlinear dynamical systems, where the system dynamics are unknown and the RL agent has to learn from a single trajectory, i.e., without resets. We propose Nonepisodic Optimistic RL (NeoRL), an approach based on the principle of optimism in the face of uncertainty. NeoRL uses well-calibrated probabilistic models and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics. Under continuity and bounded energy assumptions on the system, we provide a first-of-its-kind regret bound of $\setO(\beta_T \sqrt{T \Gamma_T})$ for general nonlinear systems with Gaussian process dynamics. We compare NeoRL to other baselines on several deep RL environments and empirically demonstrate that NeoRL achieves the optimal average cost while incurring the least regret.

我们研究了非时序强化学习（RL）的问题，其中系统动态未知，并且RL代理需要从单个轨迹中学习，即没有重置。我们提出了Nonepisodic Optimistic RL（NeoRL），这是一种基于乐观原则面对未知动态的方法。NeoRL使用经过良好校准的概率模型，并在对未知动态的认知不确定性方面进行乐观规划。在对系统连续性和有界能量的假设下，我们提供了第一个适用于具有高斯过程动态的一般非线性系统的遗憾边界为O(β_T√(TΓ_T))。我们将NeoRL与其他基准在几个深度RL环境上进行比较，并经验证明NeoRL实现了最佳平均成本，同时产生了最小的遗憾。

NeoRL：非情节强化学习的高效探索