We study the problem of experiment planning with function approximation in
contextual bandit problems. In settings where there is a significant overhead
to deploying adaptive algorithms -- for example, when the execution of the data
collection policies is required to be distributed, or a human in the loop is
needed to implement these policies -- producing in advance a set of policies
for data collection is paramount. We study the setting where a large dataset of
contexts but not rewards is available and may be used by the learner to design
an effective data collection strategy. Although when rewards are linear this
problem has been well studied, results are still missing for more complex
reward models. In this work we propose two experiment planning strategies
compatible with function approximation. The first is an eluder planning and
sampling procedure that can recover optimality guarantees depending on the
eluder dimension of the reward function class. For the second, we show that a
uniform sampler achieves competitive optimality rates in the setting where the
number of actions is small. We finalize our results introducing a statistical
gap fleshing out the fundamental differences between planning and adaptive
learning and provide results for planning with model selection.

我们研究了上下文强化学习中的函数逼近实验规划问题，针对数据收集过程存在较大开销的情况，我们提出了两种与函数逼近相容的实验规划策略。第一种是根据奖励函数类的边界维度实现的假设者规划和采样过程，可实现最优性保证。第二种是在动作数较小的情况下，我们证明了均匀采样器在实验规划中可以达到具有竞争性的最优性。最后，我们介绍了统计差距以详细阐述规划和自适应学习之间的基本差异，并提供了用于模型选择的实验规划结果。