In this paper, we investigate the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the feature dimension $d$ and the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{\mathcal{O}}(d\sqrt{\smash[b]{T/K}})$. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{\mathcal{O}}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the MNL contextual bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.

本论文研究了上下文多项式逻辑（MNL）弃权问题，其中学习代理根据上下文信息顺序选择一组，用户反馈遵循 MNL 选择模型。我们在特征维度 d 和最大组合大小 K 之间发现了显著的遗憾下界差异，并且这些边界之间奖励结构的变化使得追求最优性变得复杂。在统一奖励下，我们建立了一个遗憾下界 $Omega(dsqrt{T/K})$，并提出了一个常数时间算法 OFU-MNL+，该算法达到了上下界 $tilde{O}(dsqrt{T/K})$。在非统一奖励下，我们证明了一个下界 $Omega(dsqrt{T})$ 和上界 $tilde{O}(dsqrt{T})$，OFU-MNL+ 也可以实现这一界限。我们的实证研究支持这些理论结果。据我们所知，这是 MNL 上下文弃权文献中首次证明鞍点最优性和提出实现这一最优性的计算高效算法，达到联合因子标量对数。

多项式逻辑回归赌博机的几乎极小极大后悔