We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of $O(\sqrt{T})$. Our method leverages Langevin dynamics with a meticulously designed preconditioner as well as a simple excitation mechanism. We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process. Moreover, we identify nontrivial concentration properties of the approximate posteriors generated by our algorithm. These properties enable us to bound the moments of the system state and attain an $O(\sqrt{T})$ regret bound without the unrealistic restrictive assumptions on parameter sets that are often used in the literature.

我们提出了一种近似的Thompson采样算法，用于学习具有改进贝叶斯后悔界限为O(√T)的线性二次调节器（LQR）。我们的方法利用了经过细致设计的Langevin动力学和简单的激励机制。我们展示了激励信号随时间增长引起预条件器的最小特征值增长，从而加速近似后验采样过程。此外，我们识别出由我们的算法生成的近似后验的非平凡的浓度特性。这些特性使我们能够在不依赖于文献中常用的对参数集的不切实际的限制假设的情况下，束缚系统状态的矩，并获得O(√T)的后悔界限。

近似Thompson抽样用于学习线性二次调节器的$O(\sqrt{T})$遗憾