In this paper, we investigate the contextual multinomial logit (MNL) bandit
problem in which a learning agent sequentially selects an assortment based on
contextual information, and user feedback follows an MNL choice model. There
has been a significant discrepancy between lower and upper regret bounds,
particularly regarding the feature dimension $d$ and the maximum assortment
size $K$. Additionally, the variation in reward structures between these bounds
complicates the quest for optimality. Under uniform rewards, where all items
have the same expected reward, we establish a regret lower bound of
$\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm,
OFU-MNL+, that achieves a matching upper bound of
$\tilde{\mathcal{O}}(d\sqrt{\smash[b]{T/K}})$. Under non-uniform rewards, we
prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of
$\tilde{\mathcal{O}}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical
studies support these theoretical findings. To the best of our knowledge, this
is the first work in the MNL contextual bandit literature to prove minimax
optimality -- for either uniform or non-uniform reward setting -- and to
propose a computationally efficient algorithm that achieves this optimality up
to logarithmic factors.

本论文研究了上下文多项式逻辑（MNL）弃权问题，其中学习代理根据上下文信息顺序选择一组，用户反馈遵循 MNL 选择模型。我们在特征维度 d 和最大组合大小 K 之间发现了显著的遗憾下界差异，并且这些边界之间奖励结构的变化使得追求最优性变得复杂。在统一奖励下，我们建立了一个遗憾下界 $Omega (dsqrt {T/K})$，并提出了一个常数时间算法 OFU-MNL+，该算法达到了上下界 $tilde {O}(dsqrt {T/K})$。在非统一奖励下，我们证明了一个下界 $Omega (dsqrt {T})$ 和上界 $tilde {O}(dsqrt {T})$，OFU-MNL+ 也可以实现这一界限。我们的实证研究支持这些理论结果。据我们所知，这是 MNL 上下文弃权文献中首次证明鞍点最优性和提出实现这一最优性的计算高效算法，达到联合因子标量对数。

多项式逻辑回归赌博机的几乎极小极大后悔

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

There is a rising interest in industrial online applications where data
becomes available sequentially. Inspired by the recommendation of playlists to
users where their preferences can be collected during the listening of the
entire playlist, we study a novel bandit setting, namely Multi-Armed Bandit
with Temporally-Partitioned Rewards (TP-MAB), in which the stochastic reward
associated with the pull of an arm is partitioned over a finite number of
consecutive rounds following the pull. This setting, unexplored so far to the
best of our knowledge, is a natural extension of delayed-feedback bandits to
the case in which rewards may be dilated over a finite-time span after the pull
instead of being fully disclosed in a single, potentially delayed round. We
provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and
TP-UCB-EW, which exploit the partial information disclosed by the reward
collected over time. We show that our algorithms provide better asymptotical
regret upper bounds than delayed-feedback bandit algorithms when a property
characterizing a broad set of reward structures of practical interest, namely
alpha-smoothness, holds. We also empirically evaluate their performance across
a wide range of settings, both synthetically generated and from a real-world
media recommendation problem.

论文研究了一种新颖的赌臂算法，名为具有时间分区奖励的多臂赌博机（TP-MAB），解决了工业在线应用中数据逐步变得可用的问题，并通过提供两种算法解决 TP-MAB 问题，证明了该算法与当今最好的延迟反馈赌臂算法相比较而言，在典型情况下，效果更好。

具有时间分区奖励的多臂赌博机问题：部分反馈的重要性

Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts

The generalized linear bandit framework has attracted a lot of attention in
recent years by extending the well-understood linear setting and allowing to
model richer reward structures. It notably covers the logistic model, widely
used when rewards are binary. For logistic bandits, the frequentist regret
guarantees of existing algorithms are $\tilde{\mathcal{O}}(\kappa \sqrt{T})$,
where $\kappa$ is a problem-dependent constant. Unfortunately, $\kappa$ can be
arbitrarily large as it scales exponentially with the size of the decision set.
This may lead to significantly loose regret bounds and poor empirical
performance. In this work, we study the logistic bandit with a focus on the
prohibitive dependencies introduced by $\kappa$. We propose a new optimistic
algorithm based on a finer examination of the non-linearities of the reward
function. We show that it enjoys a $\tilde{\mathcal{O}}(\sqrt{T})$ regret with
no dependency in $\kappa$, but for a second order term. Our analysis is based
on a new tail-inequality for self-normalized martingales, of independent
interest.

本研究提出了一种针对逻辑回归赌博机的新方法，避免了先前算法中会导致较差实验结果的一种问题，并获得了较紧的后果界限，这种算法不依赖于制定决策时的尺寸。