This paper initiates the study of scale-free learning in Markov Decision
Processes (MDPs), where the scale of rewards/losses is unknown to the learner.
We design a generic algorithmic framework, \underline{S}cale
\underline{C}lipping \underline{B}ound (\texttt{SCB}), and instantiate this
framework in both the adversarial Multi-armed Bandit (MAB) setting and the
adversarial MDP setting. Through this framework, we achieve the first minimax
optimal expected regret bound and the first high-probability regret bound in
scale-free adversarial MABs, resolving an open problem raised in
\cite{hadiji2023adaptation}. On adversarial MDPs, our framework also give birth
to the first scale-free RL algorithm with a $\tilde{\mathcal{O}}(\sqrt{T})$
high-probability regret guarantee.

该研究探讨了马尔可夫决策过程中的无标度学习问题，提出了一个通用的算法框架（SCB），并在对抗性多臂赌博机和对抗性马尔可夫决策过程中应用该框架，从而实现了无标度对抗性多臂赌博机的首个鲁棒（最小化）期望遗憾上界和首个高概率遗憾上界，并产生了第一个具有 $\tilde {\mathcal {O}}(\sqrt {T})$ 高概率遗憾保证的无标度强化学习算法。