We study the sequential batch learning problem in linear contextual bandits
with finite action sets, where the decision maker is constrained to split
incoming individuals into (at most) a fixed number of batches and can only
observe outcomes for the individuals within a batch at the batch's end.
Compared to both standard online contextual bandits learning or offline policy
learning in contexutal bandits, this sequential batch learning problem provides
a finer-grained formulation of many personalized sequential decision making
problems in practical applications, including medical treatment in clinical
trials, product recommendation in e-commerce and adaptive experiment design in
crowdsourcing.
We study two settings of the problem: one where the contexts are arbitrarily
generated and the other where the contexts are \textit{iid} drawn from some
distribution. In each setting, we establish a regret lower bound and provide an
algorithm, whose regret upper bound nearly matches the lower bound. As an
important insight revealed therefrom, in the former setting, we show that the
number of batches required to achieve the fully online performance is
polynomial in the time horizon, while for the latter setting, a
pure-exploitation algorithm with a judicious batch partition scheme achieves
the fully online performance even when the number of batches is less than
logarithmic in the time horizon. Together, our results provide a near-complete
characterization of sequential decision making in linear contextual bandits
when batch constraints are present.

我们研究了线性环境中上下文臂中的顺序批处理学习问题，其中决策者被限制将个体分成（至多）固定数量的批处理，并且只能在批处理结束时观察批处理内的个体的结果。我们研究了问题的两种设置：一种是上下文是任意生成的，另一种是上下文是从某个分布中 iid 抽取的。在每个环境下，我们确定了遗憾下界，并提供了一个算法，其遗憾上界几乎与下界相匹配。

有限动作线性背景下的顺序批次学习

Sequential Batch Learning in Finite-Action Linear Contextual Bandits

Adaptive and sequential experiment design is a well-studied area in numerous
domains. We survey and synthesize the work of the online statistical learning
paradigm referred to as multi-armed bandits integrating the existing research
as a resource for a certain class of online experiments. We first explore the
traditional stochastic model of a multi-armed bandit, then explore a taxonomic
scheme of complications to that model, for each complication relating it to a
specific requirement or consideration of the experiment design context.
Finally, at the end of the paper, we present a table of known upper-bounds of
regret for all studied algorithms providing both perspectives for future
theoretical work and a decision-making tool for practitioners looking for
theoretical guarantees.

本研究调查和综合了在线统计学习范例 —— 称为多臂赌博机的领域，作为在线实验的某一类资源。我们首先探讨了传统的多臂赌博机的随机模型，然后探讨了复杂模型的分类模式，针对每种模型的复杂性与实验设计背景下的特定要求或考虑进行了说明。最后，我们提供了所有研究算法已知上限遗憾表格的决策工具，为未来理论工作提供了两方面的视角。