We consider a constrained Markov Decision Problem (CMDP) where the goal of an
agent is to maximize the expected discounted sum of rewards over an infinite
horizon while ensuring that the expected discounted sum of costs exceeds a
certain threshold. Building on the idea of momentum-based acceleration, we
develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm
that guarantees an $\epsilon$ global optimality gap and $\epsilon$ constraint
violation with $\mathcal{O}(\epsilon^{-3})$ sample complexity. This improves
the state-of-the-art sample complexity in CMDP by a factor of
$\mathcal{O}(\epsilon^{-1})$.

在受限制的马尔可夫决策问题（CMDP）中，我们开发了原始 - 对偶加速自然策略梯度（PD-ANPG）算法，它保证了 ε 全局最优性差距和 ε 约束违反，样本复杂度为 O (ε^-3)，从而在 CMDP 的样本复杂度上取得了 O (ε^-1) 的进展。

高效约束强化学习与普适参数化

Sample-Efficient Constrained Reinforcement Learning with General  Parameterization

Media streaming is the dominant application over wireless edge (access)
networks. The increasing softwarization of such networks has led to efforts at
intelligent control, wherein application-specific actions may be dynamically
taken to enhance the user experience. The goal of this work is to develop and
demonstrate learning-based policies for optimal decision making to determine
which clients to dynamically prioritize in a video streaming setting. We
formulate the policy design question as a constrained Markov decision problem
(CMDP), and observe that by using a Lagrangian relaxation we can decompose it
into single-client problems. Further, the optimal policy takes a threshold form
in the video buffer length, which enables us to design an efficient constrained
reinforcement learning (CRL) algorithm to learn it. Specifically, we show that
a natural policy gradient (NPG) based algorithm that is derived using the
structure of our problem converges to the globally optimal policy. We then
develop a simulation environment for training, and a real-world intelligent
controller attached to a WiFi access point for evaluation. We empirically show
that the structured learning approach enables fast learning. Furthermore, such
a structured policy can be easily deployed due to low computational complexity,
leading to policy execution taking only about 15$\mu$s. Using YouTube streaming
experiments in a resource constrained scenario, we demonstrate that the CRL
approach can increase QoE by over 30%.

通过使用学习型策略来确定在视频流媒体环境中哪些客户端应该动态优先考虑，以提升用户体验和增加 30% 的 QoE，并使用低计算复杂度的结构化策略进行快速学习。

无线边缘多媒体流媒体结构化强化学习

Structured Reinforcement Learning for Media Streaming at the Wireless  Edge

Safety exploration can be regarded as a constrained Markov decision problem
where the expected long-term cost is constrained. Previous off-policy
algorithms convert the constrained optimization problem into the corresponding
unconstrained dual problem by introducing the Lagrangian relaxation technique.
However, the cost function of the above algorithms provides inaccurate
estimations and causes the instability of the Lagrange multiplier learning. In
this paper, we present a novel off-policy reinforcement learning algorithm
called Conservative Distributional Maximum a Posteriori Policy Optimization
(CDMPO). At first, to accurately judge whether the current situation satisfies
the constraints, CDMPO adapts distributional reinforcement learning method to
estimate the Q-function and C-function. Then, CDMPO uses a conservative value
function loss to reduce the number of violations of constraints during the
exploration process. In addition, we utilize Weighted Average Proportional
Integral Derivative (WAPID) to update the Lagrange multiplier stably. Empirical
results show that the proposed method has fewer violations of constraints in
the early exploration process. The final test results also illustrate that our
method has better risk control.

本文提出了一种名为约束保守分布最大后验策略优化（CDMPO）的离线强化学习算法用于安全探索中的约束决策问题，其中利用分布式强化学习方法准确估计 Q 函数和 C 函数，并利用保守的价值函数损失来减少违反约束的次数，同时使用加权平均比例积分微分（WAPID）来稳定更新拉格朗日乘子，在实验中表现出更好的风险控制能力。