In reinforcement learning (RL), rewards of states are typically considered additive, and following the Markov assumption, they are $\textit{independent}$ of states visited previously. In many important applications, such as coverage control, experiment design and informative path planning, rewards naturally have diminishing returns, i.e., their value decreases in light of similar states visited previously. To tackle this, we propose $\textit{submodular RL}$ (SubRL), a paradigm which seeks to optimize more general, non-additive (and history-dependent) rewards modelled via submodular set functions which capture diminishing returns. Unfortunately, in general, even in tabular settings, we show that the resulting optimization problem is hard to approximate. On the other hand, motivated by the success of greedy algorithms in classical submodular optimization, we propose SubPO, a simple policy gradient-based algorithm for SubRL that handles non-additive rewards by greedily maximizing marginal gains. Indeed, under some assumptions on the underlying Markov Decision Process (MDP), SubPO recovers optimal constant factor approximations of submodular bandits. Moreover, we derive a natural policy gradient approach for locally optimizing SubRL instances even in large state- and action- spaces. We showcase the versatility of our approach by applying SubPO to several applications, such as biodiversity monitoring, Bayesian experiment design, informative path planning, and coverage maximization. Our results demonstrate sample efficiency, as well as scalability to high-dimensional state-action spaces.

在强化学习中，通过使用次模式集函数来捕捉递减回报值，我们提出了SubRL的范例，该范例旨在优化非加性的奖励，通过贪婪地最大化边际收益，我们的算法SubPO能够处理非加性奖励并且恢复出亚模拟赌博的最优恒定因子逼近，我们还引入了一种自然的政策梯度方法来在大型状态和行动空间下优化SubRL实例，我们将SubPO应用于生物多样性监测、贝叶斯实验设计、信息路径规划和覆盖最大化等多个应用中，结果表明我们的方法在样本效率和可伸缩性方面都表现出良好的性能。

子模强化学习