批量非参数上下文强化学习

Feb, 2024

Batched Nonparametric Contextual Bandits

Rong Jiang, Cong Ma

TL;DR基于批处理约束条件的非参数上下文强化学习中，我们提出了批处理连续性排除和动态分箱(BaSEDB)算法，实现了最优的后悔值，通过动态地将协变量空间分割成较小的箱子，并将其宽度与批量大小相匹配，强调了静态分箱的次优性以及在完全在线设置中需要几乎恒定次数的策略更新来实现最优的后悔值。

Abstract

We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch