Recent studies have shown that episodic reinforcement learning (RL) is no harder than bandits when the total reward is bounded by $1$, and proved regret bounds that have a polylogarithmic dependence on the planning horizon $H$. However, it remains an open question that if such results can be carried over to adversarial RL, where the reward is adversarially chosen at each episode. In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a \emph{stochastic} policy. We show that our algorithm achieves an $\tilde{O}\big((d+\log (|\mathcal{S}|^2 |\mathcal{A}|))\sqrt{K}\big)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|\mathcal{S}|$ and $|\mathcal{A}|$ are the cardinalities of the state and action spaces. We also provide hardness results and regret lower bounds to justify the near optimality of our algorithm and the unavoidability of $\log|\mathcal{S}|$ and $\log|\mathcal{A}|$ in the regret bound.

本文通过提出第一个无界时间步长多次对抗强化学习的策略搜索算法，使用方差-不确定性感知加权最小二乘估计器和基于占用度量的在线搜索技术，以解决探索和对抗性奖励所带来的挑战，证明算法在全信息反馈下具有 O((d+log(|S|^2|A|))sqrt(K)) 的后悔界，其中 d 是未知转移核线性参数化的已知特征映射的维数，K 是剧集数量，|S| 和 |A| 是状态和行为空间的基数。

在对抗性线性混合MDPs中的无限制视野强化学习