We propose a Thompson sampling-based learning algorithm for the Linear
Quadratic (LQ) control problem with unknown system parameters. The algorithm is
called Thompson sampling with dynamic episodes (TSDE) where two stopping
criteria determine the lengths of the dynamic episodes in Thompson sampling.
The first stopping criterion controls the growth rate of episode length. The
second stopping criterion is triggered when the determinant of the sample
covariance matrix is less than half of the previous value. We show under some
conditions on the prior distribution that the expected (Bayesian) regret of
TSDE accumulated up to time T is bounded by O(\sqrt{T}). Here O(.) hides
constants and logarithmic factors. This is the first O(\sqrt{T} ) bound on
expected regret of learning in LQ control. By introducing a reinitialization
schedule, we also show that the algorithm is robust to time-varying drift in
model parameters. Numerical simulations are provided to illustrate the
performance of TSDE.

引入 Thompson 采样算法应对 LQ 控制问题的未知系统参数，该算法被称为具有动态阶段的 Thompson 采样（TSDE），其中包括两种停止准则来确定动态阶段的长度并呈现出具有 O (sqrt (T)) 的期望后悔值的性质，加入重启计划也展示了对于模型参数的时间变化具有稳健性。