We study batched bandit experiments and consider the problem of inference conditional on the realized stopping time, assignment probabilities, and target parameter, where all of these may be chosen adaptively using information up to the last batch of the experiment. Absent further restrictions on the experiment, we show that inference using only the results of the last batch is optimal. When the adaptive aspects of the experiment are known to be location-invariant, in the sense that they are unchanged when we shift all batch-arm means by a constant, we show that there is additional information in the data, captured by one additional linear function of the batch-arm means. In the more restrictive case where the stopping time, assignment probabilities, and target parameter are known to depend on the data only through a collection of polyhedral events, we derive computationally tractable and optimal conditional inference procedures.

在批处理的强化学习实验中，我们考虑基于实现的停止时间、分配概率和目标参数进行推理的问题，其中所有这些可以根据实验的最后一个批次的信息自适应地选择。在没有进一步限制实验的情况下，我们表明仅使用最后一个批次的结果进行推理是最优的。当已知实验的自适应方面是无位置偏差的，即当我们将所有批次-臂均值移动一个常数时它们不变，我们表明数据中存在额外信息，可以通过附加的批次-臂均值的线性函数来捕捉。在更严格的情况下，即当停止时间、分配概率和目标参数仅通过一个多面体事件集合依赖于数据时，我们推导出可计算且最优的条件推理过程。

自适应实验中的最优条件推断