We study the dynamic regret of multi-armed bandit and experts problem in non-stationary stochastic environments. We introduce a new parameter $\Lambda$, which measures the total statistical variance of the loss distributions over $T$ rounds of the process, and study how this amount affects the regret. We investigate the interaction between $\Lambda$ and $\Gamma$, which counts the number of times the distributions change, as well as $\Lambda$ and $V$, which measures how far the distributions deviates over time. One striking result we find is that even when $\Gamma$, $V$, and $\Lambda$ are all restricted to constant, the regret lower bound in the bandit setting still grows with $T$. The other highlight is that in the full-information setting, a constant regret becomes achievable with constant $\Gamma$ and $\Lambda$, as it can be made independent of $T$, while with constant $V$ and $\Lambda$, the regret still has a $T^{1/3}$ dependency. We not only propose algorithms with upper bound guarantee, but prove their matching lower bounds as well.

研究了多臂赌博机和专家问题在非稳态随机环境下的动态遗憾。通过引入度量整个损失分布在T轮过程中的统计方差的新参数Lambda，研究了这一数量对遗憾的影响。我们考察了Lambda与Gamma（计算分布更改的次数）以及Lambda和V（衡量随时间分布偏离的程度）之间的相互作用。研究发现即使将Gamma、V和Lambda都限制为常数时，在赌博设置中的遗憾下限仍会随着T的增加而增长。另一个重点是在全信息设置中，当Gamma和Lambda是常数时，可实现恒定的遗憾。同时，在Lambda为常数，而V为常数时，遗憾仍具有T^（1/3）的依赖性。我们不仅提出了具有上界保证的算法，而且也证明了它们匹配的下界。

在非平稳随机环境中追踪最优专家