We study the linear contextual bandit problem with finite action sets. When
the problem dimension is $d$, the time horizon is $T$, and there are $n \leq
2^{d/2}$ candidate actions per time period, we (1) show that the minimax
expected regret is $\Omega(\sqrt{dT (\log T) (\log n)})$ for every algorithm,
and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose
regret matches the lower bound up to iterated logarithmic factors. Our
algorithmic result saves two $\sqrt{\log T}$ factors from previous analysis,
and our information-theoretical lower bound also improves previous results by
one $\sqrt{\log T}$ factor, revealing a regret scaling quite different from
classical multi-armed bandits in which no logarithmic $T$ term is present in
minimax regret. Our proof techniques include variable confidence levels and a
careful analysis of layer sizes of SupLinUCB on the upper bound side, and
delicately constructed adversarial sequences showing the tightness of
elliptical potential lemmas on the lower bound side.

研究了有限动作集的线性上下文强化学习问题，介绍了一种名为 VCL SupLinUCB 的算法，并表明其与最佳下界相匹配，相较于之前的算法分析，节省了两个对数因子。