While quantum reinforcement learning (RL) has attracted a surge of attention
recently, its theoretical understanding is limited. In particular, it remains
elusive how to design provably efficient quantum RL algorithms that can address
the exploration-exploitation trade-off. To this end, we propose a novel
UCRL-style algorithm that takes advantage of quantum computing for tabular
Markov decision processes (MDPs) with $S$ states, $A$ actions, and horizon $H$,
and establish an $\mathcal{O}(\mathrm{poly}(S, A, H, \log T))$ worst-case
regret for it, where $T$ is the number of episodes. Furthermore, we extend our
results to quantum RL with linear function approximation, which is capable of
handling problems with large state spaces. Specifically, we develop a quantum
algorithm based on value target regression (VTR) for linear mixture MDPs with
$d$-dimensional linear representation and prove that it enjoys
$\mathcal{O}(\mathrm{poly}(d, H, \log T))$ regret. Our algorithms are variants
of UCRL/UCRL-VTR algorithms in classical RL, which also leverage a novel
combination of lazy updating mechanisms and quantum estimation subroutines.
This is the key to breaking the $\Omega(\sqrt{T})$-regret barrier in classical
RL. To the best of our knowledge, this is the first work studying the online
exploration in quantum RL with provable logarithmic worst-case regret.

我们提出了一种新的量子强化学习算法，并证明了对于 tabular MDPs and linear mixture MDPs，该算法的最坏情况后悔度是多项式级别的，是量子 RL 在线探索具有可证明的对数最坏情况后悔度的第一项研究。

具有对数最坏情况遗憾的量子强化学习的可证明高效探索

Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret

The performance of a reinforcement learning algorithm can vary drastically
during learning because of exploration. Existing algorithms provide little
information about the quality of their current policy before executing it, and
thus have limited use in high-stakes applications like healthcare. We address
this lack of accountability by proposing that algorithms output policy
certificates. These certificates bound the sub-optimality and return of the
policy in the next episode, allowing humans to intervene when the certified
quality is not satisfactory. We further introduce two new algorithms with
certificates and present a new framework for theoretical analysis that
guarantees the quality of their policies and certificates. For tabular MDPs, we
show that computing certificates can even improve the sample-efficiency of
optimism-based exploration. As a result, one of our algorithms is the first to
achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm
also matches (and in some settings slightly improves upon) existing minimax
regret bounds.

提出了输出策略证书的强化学习算法，这些证书限制了下一个 episode 策略的次优性和回报，并保证了算法策略和证书质量的理论分析，同时这个算法是第一个实现了 minimax-optimal PAC bounds 的，能够在一些情况下与现有的 minimax regret bounds 匹配或略有改善。