The expected regret of any reinforcement learning algorithm is lower bounded
by $\Omega\left(\sqrt{DXAT}\right)$ for undiscounted returns, where $D$ is the
diameter of the Markov decision process, $X$ the size of the state space, $A$
the size of the action space and $T$ the number of time steps. However, this
lower bound is general. A smaller regret can be obtained by taking into account
some specific knowledge of the problem structure. In this article, we consider
an admission control problem to an $M/M/c/S$ queue with $m$ job classes and
class-dependent rewards and holding costs. Queuing systems often have a
diameter that is exponential in the buffer size $S$, making the previous lower
bound prohibitive for any practical use. We propose an algorithm inspired by
UCRL2, and use the structure of the problem to upper bound the expected total
regret by $O(S\log T + \sqrt{mT \log T})$ in the finite server case. In the
infinite server case, we prove that the dependence of the regret on $S$
disappears.

任何强化学习算法的期望遗憾在无折扣回报情况下下界为 $\Omega\left (\sqrt {DXAT}\right)$，其中 $D$ 表示马尔科夫决策过程的直径，$X$ 表示状态空间的大小，$A$ 表示动作空间的大小，$T$ 表示时间步数。然而，这个下界是一般性的，考虑到问题结构的一些具体知识可以获得更小的遗憾。在本文中，我们考虑了一个具有 $m$ 个作业类和类依赖奖励和持有成本的 $M/M/c/S$ 队列的入场控制问题。排队系统的直径通常是缓冲区大小 $S$ 的指数级，这使得先前的下界在实际使用中变得困难。我们提出了一种受 UCRL2 启发的算法，并利用问题结构来上界有限服务器情况下的期望总遗憾为 $O (S\log T + \sqrt {mT \log T})$。在无限服务器情况下，我们证明了遗憾对 $S$ 的依赖性消失。