In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.

通过竞争分析的视角，我们量化分析了先见之明的未来回报信息的价值，并且得出了标准RL代理和具有部分未来回报展望的代理之间的比率。我们刻画了最坏情况下的回报分布，并得出了最坏情况下回报期望的精确比率。结果令人惊讶的是，所得比率与离线RL和无回报探索中的已知数量相关。我们还提供了给定最坏动态情况下的比率的严格界限。我们的结果涵盖了在行动之前观察即时回报到在交互开始之前观察所有回报之间的所有情况。

强化学习中奖励展望的价值