Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation, since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of $\tilde{O}(K^{6/7})$ ($K$ denotes the number of episodes), which admits a large room for improvement. In this paper, we investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of $\tilde{O}(K^{4/5})$ for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.

本文探讨了如何用线性优化的方法解决在对抗环境下的马尔科夫决策过程问题，通过将特征映射设置到线性优化的赌臂中，得到了不需要访问转移模拟器的新技术，并在探索性的假设下，将线性对手马尔科夫决策问题的最优结果从 $	ilde{O}(K^{6/7})$ 提高到了 $	ilde{O}(K^{4/5})$。

通过线性优化改进线性对抗MDPs的遗憾界