We consider a sequential stochastic multi-armed bandit problem where the
agent interacts with bandit over multiple episodes. The reward distribution of
the arms remain constant throughout an episode but can change over different
episodes. We propose an algorithm based on UCB to transfer the reward samples
from the previous episodes and improve the cumulative regret performance over
all the episodes. We provide regret analysis and empirical results for our
algorithm, which show significant improvement over the standard UCB algorithm
without transfer.

在该研究中，我们考虑了一个顺序随机多臂赌博问题，在多个回合中，代理与赌博机进行交互。臂的奖励分布在一个回合中保持不变，但在不同回合中可能发生变化。我们提出了一种基于 UCB 算法的方法，用于传输来自先前回合的奖励样本，并改善所有回合中的累积遗憾表现。我们对该算法进行了遗憾分析和实证结果，结果显示与无传输的标准 UCB 算法相比有明显的改进。

序列多臂赌博机中的奖励样本传输

Transfer in Sequential Multi-armed Bandits via Reward Samples

The multi-armed bandit(MAB) is a classical sequential decision problem. Most
work requires assumptions about the reward distribution (e.g., bounded), while
practitioners may have difficulty obtaining information about these
distributions to design models for their problems, especially in non-stationary
MAB problems. This paper aims to design a multi-armed bandit algorithm that can
be implemented without using information about the reward distribution while
still achieving substantial regret upper bounds. To this end, we propose a
novel algorithm alternating between greedy rule and forced exploration. Our
method can be applied to Gaussian, Bernoulli and other subgaussian
distributions, and its implementation does not require additional information.
We employ a unified analysis method for different forced exploration strategies
and provide problem-dependent regret upper bounds for stationary and
piecewise-stationary settings. Furthermore, we compare our algorithm with
popular bandit algorithms on different reward distributions.

设计一种不使用奖励分布信息的多臂赌博机算法，通过交替应用贪婪规则与强制探索来实现显著的后悔上界，并提供不同强制探索策略下的问题依赖性后悔上界分析方法，适用于不同奖励分布的固定和分段固定设置。

强制性探索在赌博问题中的应用

Forced Exploration in Bandit Problems

We consider a fully cooperative multi-agent system where agents cooperate to
maximize a system's utility in a partial-observable environment. We propose
that multi-agent systems must have the ability to (1) communicate and
understand the inter-plays between agents and (2) correctly distribute rewards
based on an individual agent's contribution. In contrast, most work in this
setting considers only one of the above abilities. In this study, we develop an
architecture that allows for communication among agents and tailors the
system's reward for each individual agent. Our architecture represents agent
communication through graph convolution and applies an existing credit
assignment structure, counterfactual multi-agent policy gradient (COMA), to
assist agents to learn communication by back-propagation. The flexibility of
the graph structure enables our method to be applicable to a variety of
multi-agent systems, e.g. dynamic systems that consist of varying numbers of
agents and static systems with a fixed number of agents. We evaluate our method
on a range of tasks, demonstrating the advantage of marrying communication with
credit assignment. In the experiments, our proposed method yields better
performance than the state-of-art methods, including COMA. Moreover, we show
that the communication strategies offers us insights and interpretability of
the system's cooperative policies.

该研究提出了一种基于图卷积和多因素策略梯度的架构，用于解决在多观察环境下多智能体之间合作最大化系统功用时的通信和奖励分配问题，并在一系列任务中取得了优异表现。

具有图卷积通信的反事实多智体强化学习

Counterfactual Multi-Agent Reinforcement Learning with Graph Convolution  Communication

Multi-armed bandit algorithms have become a reference solution for handling
the explore/exploit dilemma in recommender systems, and many other important
real-world problems, such as display advertisement. However, such algorithms
usually assume a stationary reward distribution, which hardly holds in practice
as users' preferences are dynamic. This inevitably costs a recommender system
consistent suboptimal performance. In this paper, we consider the situation
where the underlying distribution of reward remains unchanged over (possibly
short) epochs and shifts at unknown time instants. In accordance, we propose a
contextual bandit algorithm that detects possible changes of environment based
on its reward estimation confidence and updates its arm selection strategy
respectively. Rigorous upper regret bound analysis of the proposed algorithm
demonstrates its learning effectiveness in such a non-trivial environment.
Extensive empirical evaluations on both synthetic and real-world datasets for
recommendation confirm its practical utility in a changing environment.

本文介绍了一种上下文赌博算法，它基于奖励估计置信度来检测环境变化并相应地更新其臂选择策略，而严格的上限遗憾分析证明了其在非平凡环境中的学习效果。