In reinforcement learning, the objective is almost always defined as a
\emph{cumulative} function over the rewards along the process. However, there
are many optimal control and reinforcement learning problems in various
application fields, especially in communications and networking, where the
objectives are not naturally expressed as summations of the rewards. In this
paper, we recognize the prevalence of non-cumulative objectives in various
problems, and propose a modification to existing algorithms for optimizing such
objectives. Specifically, we dive into the fundamental building block for many
optimal control and reinforcement learning algorithms: the Bellman optimality
equation. To optimize a non-cumulative objective, we replace the original
summation operation in the Bellman update rule with a generalized operation
corresponding to the objective. Furthermore, we provide sufficient conditions
on the form of the generalized operation as well as assumptions on the Markov
decision process under which the globally optimal convergence of the
generalized Bellman updates can be guaranteed. We demonstrate the idea
experimentally with the bottleneck objective, i.e., the objectives determined
by the minimum reward along the process, on classical optimal control and
reinforcement learning tasks, as well as on two network routing problems on
maximizing the flow rates.

针对优化问题目标函数不能直接作为奖励和累计的情况，提出了一种基于 Bellman 最优条件下广义 Bellman 更新算法，其中使用一种广义操作代替原来 Bellman 更新规则中的求和操作。

非累计目标的强化学习

Reinforcement Learning with Non-Cumulative Objective

When function approximation is used, solving the Bellman optimality equation
with stability guarantees has remained a major open problem in reinforcement
learning for decades. The fundamental difficulty is that the Bellman operator
may become an expansion in general, resulting in oscillating and even divergent
behavior of popular algorithms like Q-learning. In this paper, we revisit the
Bellman equation, and reformulate it into a novel primal-dual optimization
problem using Nesterov's smoothing technique and the Legendre-Fenchel
transformation. We then develop a new algorithm, called Smoothed Bellman Error
Embedding, to solve this optimization problem where any differentiable function
class may be used. We provide what we believe to be the first convergence
guarantee for general nonlinear function approximation, and analyze the
algorithm's sample complexity. Empirically, our algorithm compares favorably to
state-of-the-art baselines in several benchmark control problems.

本文使用 Nesterov 的平滑技术和 Legendre-Fenchel 变换将贝尔曼方程式重新构成一个新的原始对偶优化问题，并开发了一个名为平滑贝尔曼误差嵌入的新算法来解决这个优化问题，其中可以使用任何可微分类函数。我们提供了通用非线性函数逼近的第一个收敛保证，并分析了算法的样本复杂度。经验上，我们的算法在几个基准控制问题中与最先进的基准线相比表现得非常好。