We study nonparametric contextual bandits under batch constraints, where the
expected reward for each action is modeled as a smooth function of covariates,
and the policy updates are made at the end of each batch of observations. We
establish a minimax regret lower bound for this setting and propose Batched
Successive Elimination with Dynamic Binning (BaSEDB) that achieves optimal
regret (up to logarithmic factors). In essence, BaSEDB dynamically splits the
covariate space into smaller bins, carefully aligning their widths with the
batch size. We also show the suboptimality of static binning under batch
constraints, highlighting the necessity of dynamic binning. Additionally, our
results suggest that a nearly constant number of policy updates can attain
optimal regret in the fully online setting.

基于批处理约束条件的非参数上下文强化学习中，我们提出了批处理连续性排除和动态分箱 (BaSEDB) 算法，实现了最优的后悔值，通过动态地将协变量空间分割成较小的箱子，并将其宽度与批量大小相匹配，强调了静态分箱的次优性以及在完全在线设置中需要几乎恒定次数的策略更新来实现最优的后悔值。

批量非参数上下文强化学习

Batched Nonparametric Contextual Bandits

We study nonparametric contextual bandits where Lipschitz mean reward
functions may change over time. We first establish the minimax dynamic regret
rate in this less understood setting in terms of number of changes $L$ and
total-variation $V$, both capturing all changes in distribution over context
space, and argue that state-of-the-art procedures are suboptimal in this
setting.
Next, we tend to the question of an adaptivity for this setting, i.e.
achieving the minimax rate without knowledge of $L$ or $V$. Quite importantly,
we posit that the bandit problem, viewed locally at a given context $X_t$,
should not be affected by reward changes in other parts of context space $\cal
X$. We therefore propose a notion of change, which we term experienced
significant shifts, that better accounts for locality, and thus counts
considerably less changes than $L$ and $V$. Furthermore, similar to recent work
on non-stationary MAB (Suk & Kpotufe, 2022), experienced significant shifts
only count the most significant changes in mean rewards, e.g., severe best-arm
changes relevant to observed contexts.
Our main result is to show that this more tolerant notion of change can in
fact be adapted to.

研究非参数情境赌博问题，提出经验显著变化的概念来适应不断变化的均值回报函数，证明该更宽容的变化概念可实现最小化的动态遗憾率。