This paper presents the first algorithm for model-based offline quantum
reinforcement learning and demonstrates its functionality on the cart-pole
benchmark. The model and the policy to be optimized are each implemented as
variational quantum circuits. The model is trained by gradient descent to fit a
pre-recorded data set. The policy is optimized with a gradient-free
optimization scheme using the return estimate given by the model as the fitness
function. This model-based approach allows, in principle, full realization on a
quantum computer during the optimization phase and gives hope that a quantum
advantage can be achieved as soon as sufficiently powerful quantum computers
are available.

这篇论文提出了第一种基于模型的离线量子强化学习算法，并在滑车杆平衡问题上展示了其功能。模型和待优化的策略都以变分量子电路的形式实现。通过梯度下降，模型被训练以拟合预先记录的数据集。策略使用无梯度优化方案，以模型给出的回报估计作为适应度函数进行优化。从原理上讲，这种基于模型的方法在优化阶段可以在量子计算机上完全实现，并有希望在具备足够强大的量子计算机时实现量子优势。