We consider an online decision-making problem with a reward function defined
over graph-structured data. We formally formulate the problem as an instance of
graph action bandit. We then propose \texttt{GNN-TS}, a Graph Neural Network
(GNN) powered Thompson Sampling (TS) algorithm which employs a GNN approximator
for estimating the mean reward function and the graph neural tangent features
for uncertainty estimation. We prove that, under certain boundness assumptions
on the reward function, GNN-TS achieves a state-of-the-art regret bound which
is (1) sub-linear of order $\tilde{\mathcal{O}}((\tilde{d} T)^{1/2})$ in the
number of interaction rounds, $T$, and a notion of effective dimension
$\tilde{d}$, and (2) independent of the number of graph nodes. Empirical
results validate that our proposed \texttt{GNN-TS} exhibits competitive
performance and scales well on graph action bandit problems.

我们提出了一种基于图神经网络和汤普森抽样算法的在线决策问题求解方法，该方法在估计奖励函数的平均值和不确定性估计方面利用了图神经网络近似器，并证明在一定奖励函数边界的假设下，该方法在交互轮次数量和有效维度上能够达到线性次数的亚线性遗憾界，并且与图节点数量无关。实证结果验证了我们提出的方法在图行动赌博问题上具有竞争力的表现并且能够良好地扩展。

图神经汤普森采样

Graph Neural Thompson Sampling

The bandits with knapsack (BwK) framework models online decision-making
problems in which an agent makes a sequence of decisions subject to resource
consumption constraints. The traditional model assumes that each action
consumes a non-negative amount of resources and the process ends when the
initial budgets are fully depleted. We study a natural generalization of the
BwK framework which allows non-monotonic resource utilization, i.e., resources
can be replenished by a positive amount. We propose a best-of-both-worlds
primal-dual template that can handle any online learning problem with
replenishment for which a suitable primal regret minimizer exists. In
particular, we provide the first positive results for the case of adversarial
inputs by showing that our framework guarantees a constant competitive ratio
$\alpha$ when $B=\Omega(T)$ or when the possible per-round replenishment is a
positive constant. Moreover, under a stochastic input model, our algorithm
yields an instance-independent $\tilde{O}(T^{1/2})$ regret bound which
complements existing instance-dependent bounds for the same setting. Finally,
we provide applications of our framework to some economic problems of practical
relevance.

该研究提出了一种 BwK 框架的一般化模型，允许非单调资源利用，并提出了一个灵活的双重模板以处理任何具有再生性问题的在线学习问题，包括对抗和随机输入，同时可用于解决一些实际相关的经济问题。

带补给背包的强盗问题：两全其美

Bandits with Replenishable Knapsacks: the Best of both Worlds

Online decision-making problem requires us to make a sequence of decisions
based on incremental information. Common solutions often need to learn a reward
model of different actions given the contextual information and then maximize
the long-term reward. It is meaningful to know if the posited model is
reasonable and how the model performs in the asymptotic sense. We study this
problem under the setup of the contextual bandit framework with a linear reward
model. The $\varepsilon$-greedy policy is adopted to address the classic
exploration-and-exploitation dilemma. Using the martingale central limit
theorem, we show that the online ordinary least squares estimator of model
parameters is asymptotically normal. When the linear model is misspecified, we
propose the online weighted least squares estimator using the inverse
propensity score weighting and also establish its asymptotic normality. Based
on the properties of the parameter estimators, we further show that the
in-sample inverse propensity weighted value estimator is asymptotically normal.
We illustrate our results using simulations and an application to a news
article recommendation dataset from Yahoo!.

这篇论文研究在线决策问题，通过采用上下文乐队 it，并建立奖励模型来进行长期奖励最大化。 使用估计模型参数的 OLS 和 WLS 方法来处理该问题，借助中心极限定理证明了参数的渐近正常性。同时，我们还通过实验验证了我们的结论。