We study the data packet transmission problem (mmDPT) in dense cell-free
millimeter wave (mmWave) networks, i.e., users sending data packet requests to
access points (APs) via uplinks and APs transmitting requested data packets to
users via downlinks. Our objective is to minimize the average delay in the
system due to APs' limited service capacity and unreliable wireless channels
between APs and users. This problem can be formulated as a restless multi-armed
bandits problem with fairness constraint (RMAB-F). Since finding the optimal
policy for RMAB-F is intractable, existing learning algorithms are
computationally expensive and not suitable for practical dynamic dense mmWave
networks. In this paper, we propose a structured reinforcement learning (RL)
solution for mmDPT by exploiting the inherent structure encoded in RMAB-F. To
achieve this, we first design a low-complexity and provably asymptotically
optimal index policy for RMAB-F. Then, we leverage this structure information
to develop a structured RL algorithm called mmDPT-TS, which provably achieves
an \tilde{O}(\sqrt{T}) Bayesian regret. More importantly, mmDPT-TS is
computation-efficient and thus amenable to practical implementation, as it
fully exploits the structure of index policy for making decisions. Extensive
emulation based on data collected in realistic mmWave networks demonstrate
significant gains of mmDPT-TS over existing approaches.

通过利用编码在 RMAB-F 中的内在结构，我们提出了一个结构化强化学习解决方案 mmDPT-TS，最小化了由 AP 的有限服务能力和 AP 与用户之间不可靠无线信道引起的系统平均延迟。

密集毫米波网络中延迟优化数据传输的结构化强化学习

Structured Reinforcement Learning for Delay-Optimal Data Transmission in  Dense mmWave Networks

Structured reinforcement learning leverages policies with advantageous
properties to reach better performance, particularly in scenarios where
exploration poses challenges. We explore this field through the concept of
orchestration, where a (small) set of expert policies guides decision-making;
the modeling thereof constitutes our first contribution. We then establish
value-functions regret bounds for orchestration in the tabular setting by
transferring regret-bound results from adversarial settings. We generalize and
extend the analysis of natural policy gradient in Agarwal et al. [2021, Section
5.3] to arbitrary adversarial aggregation strategies. We also extend it to the
case of estimated advantage functions, providing insights into sample
complexity both in expectation and high probability. A key point of our
approach lies in its arguably more transparent proofs compared to existing
methods. Finally, we present simulations for a stochastic matching toy model.

结构化强化学习通过具有优势特性的策略来提高性能，尤其在探索具有挑战性的情景中。本文通过协同行为的概念进行了研究，其中一组专家策略引导决策，建立了模型。我们从对手设置中传递后悔边界结果，为表格设置中的协同行为建立了值函数后悔边界。我们还将 Agarwal 等人 [2021 年，5.3 节] 的自然策略梯度分析推广和扩展到任意对手聚合策略的情况，并将其扩展到估计优势函数的情况，提供了关于样本复杂度的期望和高概率的见解。我们的方法的一个关键点在于相对于现有方法，其证明过程更为透明。最后，我们给出了一个随机匹配玩具模型的模拟。