We explore a stochastic contextual linear bandit problem where the agent observes a noisy, corrupted version of the true context through a noise channel with an unknown noise parameter. Our objective is to design an action policy that can approximate" that of an oracle, which has access to the reward model, the channel parameter, and the predictive distribution of the true context from the observed noisy context. In a Bayesian framework, we introduce a Thompson sampling algorithm for Gaussian bandits with Gaussian context noise. Adopting an information-theoretic analysis, we demonstrate the Bayesian regret of our algorithm concerning the oracle's action policy. We also extend this problem to a scenario where the agent observes the true context with some delay after receiving the reward and show that delayed true contexts lead to lower Bayesian regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.

我们研究了一种随机情境线性赌博机问题，代理人通过一个未知噪声参数的噪声信道观察到真实情境的有噪声、损坏的版本。我们的目标是设计一种行动策略，可以近似一个能够获取奖励模型、信道参数以及根据观察到的有噪声情境从真实情境得到预测分布的神谕的行动策略。我们在贝叶斯框架下引入了一种基于高斯情境噪声的汤普森采样算法。采用信息论分析，对于神谕的行动策略，我们证明了该算法的贝叶斯遗憾。我们还将这个问题扩展到当代理人在接收到奖励之后，以一定延迟观察到真实情境的情景，并展示了延迟真实情境会导致更低的贝叶斯遗憾。最后，我们通过与基准算法进行实证研究，展示了所提出算法的性能。

基于信息论的噪声上下文随机赌博机的汤普森抽样算法的遗憾分析