Most reinforcement learning algorithms with regret guarantees rely on a critical assumption: that all errors are recoverable. Recent work by Plaut et al. discarded this assumption and presented algorithms that avoid "catastrophe" (i.e., irreparable errors) by asking for help. However, they provided only safety guarantees and did not consider reward maximization. We prove that any algorithm that avoids catastrophe in their setting also guarantees high reward (i.e., sublinear regret) in any Markov Decision Process (MDP), including MDPs with irreversible costs. This constitutes the first no-regret guarantee for general MDPs. More broadly, our result may be the first formal proof that it is possible for an agent to obtain high reward while becoming self-sufficient in an unknown, unbounded, and high-stakes environment without causing catastrophe or requiring resets.

本研究解决了现有强化学习算法在面对不可逆错误时缺乏奖励最大化的问题。提出了一种新方法，证明在特定情境下，避免灾难的算法不仅可以保障安全，还能确保高回报。这项研究首次为一般马尔可夫决策过程提供了无悔保证，表明在不确定和高风险的环境中，智能体能够在获得高回报的同时实现自给自足。 

寻求帮助实现安全保障而不牺牲有效性