The principle of optimism in the face of uncertainty is one of the most widely used and successful ideas in multi-armed bandits and reinforcement learning. However, existing optimistic algorithms (primarily UCB and its variants) are often unable to deal with large context spaces. Essentially all existing well performing algorithms for general contextual bandit problems rely on weighted action allocation schemes; and theoretical guarantees for optimism-based algorithms are only known for restricted formulations. In this paper we study general contextual bandits under the realizability condition, and propose a simple generic principle to design optimistic algorithms, dubbed "Upper Counterfactual Confidence Bounds" (UCCB). We show that these policies are provably optimal and efficient in the presence of large context spaces. Key components of UCCB include: 1) a systematic analysis of confidence bounds in policy space rather than in action space; and 2) the potential function perspective that is used to express the power of optimism in the contextual setting. We show how the basic principles can be extended to infinite action spaces, by constructing confidence bounds via the newly introduced notion of "counterfactual action divergence."

本文研究实现条件下的通用上下文医生模型并提出了一种名为'Upper Counterfactual Confidence Bounds'的乐观算法，该算法通过在策略空间而非行动空间分析置信区间以及利用潜在功能视角表达在上下文环境中乐观情绪的作用来解决大上下文空间下的问题，并通过引入‘反事实行动偏差’的概念来扩展UCCB原理以涵盖无限行动空间。

上界逆事实置信区间：一种新的胜算原理用于上下文多臂赌博机