This paper introduces the Lagrange Policy for Continuous Actions (LPCA), a
reinforcement learning algorithm specifically designed for weakly coupled MDP
problems with continuous action spaces. LPCA addresses the challenge of
resource constraints dependent on continuous actions by introducing a Lagrange
relaxation of the weakly coupled MDP problem within a neural network framework
for Q-value computation. This approach effectively decouples the MDP, enabling
efficient policy learning in resource-constrained environments. We present two
variations of LPCA: LPCA-DE, which utilizes differential evolution for global
optimization, and LPCA-Greedy, a method that incrementally and greadily selects
actions based on Q-value gradients. Comparative analysis against other
state-of-the-art techniques across various settings highlight LPCA's robustness
and efficiency in managing resource allocation while maximizing rewards.

该论文介绍了用于弱耦合 MDP 问题和连续动作空间的 Lagrange 策略 (LPCA) 一种强化学习算法，它通过在神经网络框架中引入弱耦合 MDP 问题的 Lagrange 松弛来解决依赖于连续动作的资源约束挑战，并有效地解耦了 MDP，从而实现在资源受限环境中的高效策略学习。我们提出了两个 LPCA 的变体：LPCA-DE，它利用差分进化进行全局优化；LPCA-Greedy，它基于 Q 值梯度逐步贪心地选择动作。在不同设置下与其他最先进技术进行的比较分析突出了 LPCA 在资源分配管理和最大化奖励方面的鲁棒性和效率。