This work investigates the offline formulation of the contextual bandit problem, where the goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. Motivated by critical applications, we move beyond point estimators. Instead, we adopt the principle of pessimism where we construct upper bounds that assess a policy's worst-case performance, enabling us to confidently select and learn improved policies. Precisely, we introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators. These bounds are general enough to cover most existing estimators and pave the way for the development of new ones. In particular, our pursuit of the tightest bound within this class motivates a novel estimator (LS), that logarithmically smooths large importance weights. The bound for LS is provably tighter than all its competitors, and naturally results in improved policy selection and learning strategies. Extensive policy evaluation, selection, and learning experiments highlight the versatility and favorable performance of LS.

该研究调查了在线情境决策问题的离线公式化，其目标是利用在行为策略下收集的过往互动来评估、选择和学习新的、潜在更好性能的策略。通过采用悲观主义的原则构建对策略最坏情况性能的上限界，我们超越了点估计器，引入了对一类广泛的重要性加权风险估计器的新颖、完全经验的集中界。这些界足够一般，覆盖了大多数现有的估计器，并为新估计器的开发铺平了道路。特别地，在类别中寻求最紧密的界的追求激发了一种新的估计器（LS），该估计器对大的重要性权重进行对数平滑。LS的界证明比所有竞争者都紧，自然而然地导致改进的策略选择和学习策略。广泛的策略评估、选择和学习实验证明了LS的多样性和有利性能。

悲观的脱机政策评估、选择和学习的对数平滑