We study the problem of learning Markov decision processes with finite state
and action spaces when the transition probability distributions and loss
functions are chosen adversarially and are allowed to change with time. We
introduce an algorithm whose regret with respect to any policy in a comparison
class grows as the square root of the number of rounds of the game, provided
the transition probabilities satisfy a uniform mixing condition. Our approach
is efficient as long as the comparison class is polynomial and we can compute
expectations over sample paths for each policy. Designing an efficient
algorithm with small regret for the general case remains an open problem.

研究了在转换概率分布和损失函数是对手选择并随时间变化时，如何学习具有有限状态和动作空间的马尔可夫决策过程问题。介绍了一种算法，如果转换概率满足均匀混合条件，则任何比较类中的策略的后悔增长为比赛轮数的平方根。只要比较类是多项式级别且我们可以为每个策略计算样本路径的期望值，我们的方法就是有效的。对于一般情况的后悔小的高效算法仍然是一个开放的问题。