Bayesian bandit algorithms with approximate Bayesian inference have been
widely used in real-world applications. Nevertheless, their theoretical
justification is less investigated in the literature, especially for contextual
bandit problems. To fill this gap, we propose a general theoretical framework
to analyze stochastic linear bandits in the presence of approximate inference
and conduct regret analysis on two Bayesian bandit algorithms, Linear Thompson
sampling (LinTS) and the extension of Bayesian Upper Confidence Bound, namely
Linear Bayesian Upper Confidence Bound (LinBUCB). We demonstrate that both
LinTS and LinBUCB can preserve their original rates of regret upper bound but
with a sacrifice of larger constant terms when applied with approximate
inference. These results hold for general Bayesian inference approaches, under
the assumption that the inference error measured by two different
$\alpha$-divergences is bounded. Additionally, by introducing a new definition
of well-behaved distributions, we show that LinBUCB improves the regret rate of
LinTS from $\tilde{O}(d^{3/2}\sqrt{T})$ to $\tilde{O}(d\sqrt{T})$, matching the
minimax optimal rate. To our knowledge, this work provides the first regret
bounds in the setting of stochastic linear bandits with bounded approximate
inference errors.

提出了一个通用的理论框架来分析具体推断存在时的随机线性赌博带中的贝叶斯赌博算法，得到了 Linear Thompson Sampling 和 Linear Bayesian Upper Confidence Bound 在近似推断时保持原有遗憾上界但需要更大的常数项的结论，引入一种新的定义来展示 Linear Bayesian Upper Confidence Bound 改进了 Linear Thompson Sampling 的遗憾速率，从而与极小的理论最优速率相匹配，这是首次在具有有界近似推断误差的随机线性赌博带设置中给出的遗憾界。

随机线性赌博机中的近似推断贝叶斯赌博算法

Bayesian Bandit Algorithms with Approximate Inference in Stochastic  Linear Bandits

We study the problem of experiment planning with function approximation in
contextual bandit problems. In settings where there is a significant overhead
to deploying adaptive algorithms -- for example, when the execution of the data
collection policies is required to be distributed, or a human in the loop is
needed to implement these policies -- producing in advance a set of policies
for data collection is paramount. We study the setting where a large dataset of
contexts but not rewards is available and may be used by the learner to design
an effective data collection strategy. Although when rewards are linear this
problem has been well studied, results are still missing for more complex
reward models. In this work we propose two experiment planning strategies
compatible with function approximation. The first is an eluder planning and
sampling procedure that can recover optimality guarantees depending on the
eluder dimension of the reward function class. For the second, we show that a
uniform sampler achieves competitive optimality rates in the setting where the
number of actions is small. We finalize our results introducing a statistical
gap fleshing out the fundamental differences between planning and adaptive
learning and provide results for planning with model selection.

我们研究了上下文强化学习中的函数逼近实验规划问题，针对数据收集过程存在较大开销的情况，我们提出了两种与函数逼近相容的实验规划策略。第一种是根据奖励函数类的边界维度实现的假设者规划和采样过程，可实现最优性保证。第二种是在动作数较小的情况下，我们证明了均匀采样器在实验规划中可以达到具有竞争性的最优性。最后，我们介绍了统计差距以详细阐述规划和自适应学习之间的基本差异，并提供了用于模型选择的实验规划结果。

利用函数逼近进行实验规划

Experiment Planning with Function Approximation

Recent advances in learning techniques have garnered attention for their
applicability to a diverse range of real-world sequential decision-making
problems. Yet, many practical applications have critical constraints for
operation in real environments. Most learning solutions often neglect the risk
of failing to meet these constraints, hindering their implementation in
real-world contexts. In this paper, we propose a risk-aware decision-making
framework for contextual bandit problems, accommodating constraints and
continuous action spaces. Our approach employs an actor multi-critic
architecture, with each critic characterizing the distribution of performance
and constraint metrics. Our framework is designed to cater to various risk
levels, effectively balancing constraint satisfaction against performance. To
demonstrate the effectiveness of our approach, we first compare it against
state-of-the-art baseline methods in a synthetic environment, highlighting the
impact of intrinsic environmental noise across different risk configurations.
Finally, we evaluate our framework in a real-world use case involving a 5G
mobile network where only our approach consistently satisfies the system
constraint (a signal processing reliability target) with a small performance
toll (8.5% increase in power consumption).

我们提出了一个风险感知的决策框架，用于处理上下文褒贬问题，并满足实际环境中的约束条件，通过采用一个多批评者的角色体系来平衡约束满足度和性能。

风险感知的神经上下文点臂连续控制

Risk-Aware Continuous Control with Neural Contextual Bandits

Contextual bandit problems are a natural fit for many information retrieval
tasks, such as learning to rank, text classification, recommendation, etc.
However, existing learning methods for contextual bandit problems have one of
two drawbacks: they either do not explore the space of all possible document
rankings (i.e., actions) and, thus, may miss the optimal ranking, or they
present suboptimal rankings to a user and, thus, may harm the user experience.
We introduce a new learning method for contextual bandit problems, Safe
Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by
using a baseline (or production) ranking system (i.e., policy), which does not
harm the user experience and, thus, is safe to execute, but has suboptimal
performance and, thus, needs to be improved. Then SEA uses counterfactual
learning to learn a new policy based on the behavior of the baseline policy.
SEA also uses high-confidence off-policy evaluation to estimate the performance
of the newly learned policy. Once the performance of the newly learned policy
is at least as good as the performance of the baseline policy, SEA starts using
the new policy to execute new actions, allowing it to actively explore
favorable regions of the action space. This way, SEA never performs worse than
the baseline policy and, thus, does not harm the user experience, while still
exploring the action space and, thus, being able to find an optimal policy. Our
experiments using text classification and document retrieval confirm the above
by comparing SEA (and a boundless variant called BSEA) to online and offline
learning methods for contextual bandit problems.

本文提出了一种名为 SEA 的新型学习方法，用于解决上下文乐观主义问题，它不会伤害用户体验，同时能够在探索空间中进行操作，从而有效地找到最佳策略。