Structured reinforcement learning leverages policies with advantageous properties to reach better performance, particularly in scenarios where exploration poses challenges. We explore this field through the concept of orchestration, where a (small) set of expert policies guides decision-making; the modeling thereof constitutes our first contribution. We then establish value-functions regret bounds for orchestration in the tabular setting by transferring regret-bound results from adversarial settings. We generalize and extend the analysis of natural policy gradient in Agarwal et al. [2021, Section 5.3] to arbitrary adversarial aggregation strategies. We also extend it to the case of estimated advantage functions, providing insights into sample complexity both in expectation and high probability. A key point of our approach lies in its arguably more transparent proofs compared to existing methods. Finally, we present simulations for a stochastic matching toy model.

结构化强化学习通过具有优势特性的策略来提高性能，尤其在探索具有挑战性的情景中。本文通过协同行为的概念进行了研究，其中一组专家策略引导决策，建立了模型。我们从对手设置中传递后悔边界结果，为表格设置中的协同行为建立了值函数后悔边界。我们还将Agarwal等人[2021年，5.3节]的自然策略梯度分析推广和扩展到任意对手聚合策略的情况，并将其扩展到估计优势函数的情况，提供了关于样本复杂度的期望和高概率的见解。我们的方法的一个关键点在于相对于现有方法，其证明过程更为透明。最后，我们给出了一个随机匹配玩具模型的模拟。

专家的交响曲：在强化学习中使用对抗性洞察的指挥