Considering non-stationary environments in online optimization enables
decision-maker to effectively adapt to changes and improve its performance over
time. In such cases, it is favorable to adopt a strategy that minimizes the
negative impact of change to avoid potentially risky situations. In this paper,
we investigate risk-averse online optimization where the distribution of the
random cost changes over time. We minimize risk-averse objective function using
the Conditional Value at Risk (CVaR) as risk measure. Due to the difficulty in
obtaining the exact CVaR gradient, we employ a zeroth-order optimization
approach that queries the cost function values multiple times at each iteration
and estimates the CVaR gradient using the sampled values. To facilitate the
regret analysis, we use a variation metric based on Wasserstein distance to
capture time-varying distributions. Given that the distribution variation is
sub-linear in the total number of episodes, we show that our designed learning
algorithm achieves sub-linear dynamic regret with high probability for both
convex and strongly convex functions. Moreover, theoretical results suggest
that increasing the number of samples leads to a reduction in the dynamic
regret bounds until the sampling number reaches a specific limit. Finally, we
provide numerical experiments of dynamic pricing in a parking lot to illustrate
the efficacy of the designed algorithm.

本文研究在线优化中的非稳态环境，以便决策者能够适应变化并提高性能。我们采用最小化风险敏感目标函数的策略，使用条件风险价值 (CVaR) 作为风险度量，并使用零阶优化方法来估计 CVaR 梯度。理论结果表明，我们设计的学习算法在凸和强凸函数上能够以高概率实现子线性动态遗憾。同时，数值实验在停车场动态定价方面展示了所设计算法的有效性。

非平稳分布下的风险规避学习

Risk-averse Learning with Non-Stationary Distributions

We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize
the Conditional Value at Risk (CVaR) with a fixed risk tolerance $\tau$. Prior
theoretical work studying risk-sensitive RL focuses on the tabular Markov
Decision Processes (MDPs) setting. To extend CVaR RL to settings where state
space is large, function approximation must be deployed. We study CVaR RL in
low-rank MDPs with nonlinear function approximation. Low-rank MDPs assume the
underlying transition kernel admits a low-rank decomposition, but unlike prior
linear models, low-rank MDPs do not assume the feature or state-action
representation is known. We propose a novel Upper Confidence Bound (UCB)
bonus-driven algorithm to carefully balance the interplay between exploration,
exploitation, and representation learning in CVaR RL. We prove that our
algorithm achieves a sample complexity of $\tilde{O}\left(\frac{H^7 A^2
d^4}{\tau^2 \epsilon^2}\right)$ to yield an $\epsilon$-optimal CVaR, where $H$
is the length of each episode, $A$ is the capacity of action space, and $d$ is
the dimension of representations. Computational-wise, we design a novel
discretized Least-Squares Value Iteration (LSVI) algorithm for the CVaR
objective as the planning oracle and show that we can find the near-optimal
policy in a polynomial running time with a Maximum Likelihood Estimation
oracle. To our knowledge, this is the first provably efficient CVaR RL
algorithm in low-rank MDPs.

我们研究了风险敏感的强化学习 (RL)，其中我们的目标是通过固定风险容忍度 τ 来最大化条件风险价值 (CVaR)。我们在大规模状态空间中使用 CVaR RL 来拓展推广 CVaR RL，功能逼近必须得到部署。在非线性功能逼近中，我们研究了低秩 MDPs 中的 CVaR RL。低秩 MDPs 假设底层转移核函数具有低秩分解，但与线性模型不同，低秩 MDPs 不假设已知特征或状态 - 动作表示。我们提出了一种新颖的上限信心界 (UCB) 奖励驱动算法，以在 CVaR RL 中精确平衡勘探、开发和表征学习之间的相互作用。我们证明我们的算法可以以样本复杂度 Õ((H^7 A^2 d^4) / (τ^2 ε^2)) 实现 ε- 最优 CVaR，其中 H 是每个 episode 的长度，A 是动作空间的容量，d 是表示的维度。在计算方面，我们为 CVaR 目标设计了一种新颖的离散最小二乘值迭代 (LSVI) 算法作为规划预期，并展示了我们可以在多项式时间内通过最大似然估计规划预期来找到接近最优的策略。据我们所知，这是第一个在低秩 MDPs 中可以被证明的有效的 CVaR RL 算法。