BriefGPT.xyz
Feb, 2024
批量非参数上下文强化学习
Batched Nonparametric Contextual Bandits
HTML
PDF
Rong Jiang, Cong Ma
TL;DR
基于批处理约束条件的非参数上下文强化学习中,我们提出了批处理连续性排除和动态分箱(BaSEDB)算法,实现了最优的后悔值,通过动态地将协变量空间分割成较小的箱子,并将其宽度与批量大小相匹配,强调了静态分箱的次优性以及在完全在线设置中需要几乎恒定次数的策略更新来实现最优的后悔值。
Abstract
We study
nonparametric contextual bandits
under
batch constraints
, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch
→