We consider the problem of repeatedly choosing policies to maximize social
welfare. Welfare is a weighted sum of private utility and public revenue.
Earlier outcomes inform later policies. Utility is not observed, but indirectly
inferred. Response functions are learned through experimentation.
We derive a lower bound on regret, and a matching adversarial upper bound for
a variant of the Exp3 algorithm. Cumulative regret grows at a rate of
$T^{2/3}$. This implies that (i) welfare maximization is harder than the
multi-armed bandit problem (with a rate of $T^{1/2}$ for finite policy sets),
and (ii) our algorithm achieves the optimal rate. For the stochastic setting,
if social welfare is concave, we can achieve a rate of $T^{1/2}$ (for
continuous policy sets), using a dyadic search algorithm.
We analyze an extension to nonlinear income taxation, and sketch an extension
to commodity taxation. We compare our setting to monopoly pricing (which is
easier), and price setting for bilateral trade (which is harder).

旨在最大化社会福利，我们研究重复选择政策的问题，由私人效用和公共收入构成的加权和。通过实验证明，我们获得了与误差下界相匹配的上界，表明福利最大化比多臂赌博问题更困难，且我们的算法实现了最佳速率。

社会福利的自适应最大化

Adaptive maximization of social welfare

We consider the adversarial multi-armed bandit problem under delayed
feedback. We analyze variants of the Exp3 algorithm that tune their step-size
using only information (about the losses and delays) available at the time of
the decisions, and obtain regret guarantees that adapt to the observed (rather
than the worst-case) sequences of delays and/or losses. First, through a
remarkably simple proof technique, we show that with proper tuning of the step
size, the algorithm achieves an optimal (up to logarithmic factors) regret of
order $\sqrt{\log(K)(TK + D)}$ both in expectation and in high probability,
where $K$ is the number of arms, $T$ is the time horizon, and $D$ is the
cumulative delay. The high-probability version of the bound, which is the first
high-probability delay-adaptive bound in the literature, crucially depends on
the use of implicit exploration in estimating the losses. Then, following
Zimmert and Seldin [2019], we extend these results so that the algorithm can
"skip" rounds with large delays, resulting in regret bounds of order
$\sqrt{TK\log(K)} + |R| + \sqrt{D_{\bar{R}}\log(K)}$, where $R$ is an arbitrary
set of rounds (which are skipped) and $D_{\bar{R}}$ is the cumulative delay of
the feedback for other rounds. Finally, we present another, data-adaptive
(AdaGrad-style) version of the algorithm for which the regret adapts to the
observed (delayed) losses instead of only adapting to the cumulative delay
(this algorithm requires an a priori upper bound on the maximum delay, or the
advance knowledge of the delay for each decision when it is made). The
resulting bound can be orders of magnitude smaller on benign problems, and it
can be shown that the delay only affects the regret through the loss of the
best arm.

本文考虑在延迟反馈下的敌对多臂老虎机问题，并分析了一些通过仅使用决策时可用的信息 (关于损失和延迟) 来调整步长的 Exp3 算法变体，从而获得适应观察到的 (而不是最坏情况下的) 延迟和 / 或损失序列的遗憾保证。最后，我们介绍了 AdaGrad 风格的版本的算法，该算法通过观察到的 (延迟的) 损失进行适应，而不仅仅是适应于累积延迟 (该算法要求先验上限）。

对抗性多臂老虎机中的延迟和数据的适应

Adapting to Delays and Data in Adversarial Multi-Armed Bandits

We consider the partial observability model for multi-armed bandits,
introduced by Mannor and Shamir. Our main result is a characterization of
regret in the directed observability model in terms of the dominating and
independence numbers of the observability graph. We also show that in the
undirected case, the learner can achieve optimal regret without even accessing
the observability graph before selecting an action. Both results are shown
using variants of the Exp3 algorithm operating on the observability graph in a
time-efficient manner.

本研究考虑 Mannor 和 Shamir 引入的部分可观测性模型，利用 Exp3 算法在可观测性图上高效运行，以支配和独立数来描述定向可观测性模型的遗憾（regret），并证明在无向情况下，学习者可以在选择行动之前甚至不访问可观测性图前实现最佳遗憾。