We provide the first known algorithm that provably achieves
$\varepsilon$-optimality within $\widetilde{\mathcal{O}}(1/\varepsilon)$
function evaluations for the discounted discrete-time LQR problem with unknown
parameters, without relying on two-point gradient estimates. These estimates
are known to be unrealistic in many settings, as they depend on using the exact
same initialization, which is to be selected randomly, for two different
policies. Our results substantially improve upon the existing literature
outside the realm of two-point gradient estimates, which either leads to
$\widetilde{\mathcal{O}}(1/\varepsilon^2)$ rates or heavily relies on stability
assumptions.

我们提供了一个新的算法，可以在没有依赖于两点梯度估计的情况下，在大约 1/ε 个函数评估内确保 ε- 最优性，适用于具有未知参数的折扣离散时间 LQR 问题。

线性二次调节器的样本复杂度：强化学习视角

Sample Complexity of the Linear Quadratic Regulator: A Reinforcement  Learning Lens

We address the fundamental limits of learning unknown parameters of any
stochastic process from time-series data, and discover exact closed-form
expressions for how optimal inference scales with observation length. Given a
parametrized class of candidate models, the Fisher information of observed
sequence probabilities lower-bounds the variance in model estimation from
finite data. As sequence-length increases, the minimal variance scales as the
square inverse of the length -- with constant coefficient given by the
information rate. We discover a simple closed-form expression for this
information rate, even in the case of infinite Markov order. We furthermore
obtain the exact analytic lower bound on model variance from the
observation-induced metadynamic among belief states. We discover ephemeral,
exponential, and more general modes of convergence to the asymptotic
information rate. Surprisingly, this myopic information rate converges to the
asymptotic Fisher information rate with exactly the same relaxation timescales
that appear in the myopic entropy rate as it converges to the Shannon entropy
rate for the process. We illustrate these results with a sequence of examples
that highlight qualitatively distinct features of stochastic processes that
shape optimal learning.

从时间序列数据中学习未知参数的基本极限是我们的研究方向，我们发现最优推断的无偏估计与观测长度成比例，并得到了闭合形式表达式。

学习非马尔科夫行为的极限：费舍尔信息速率和超额信息

Ultimate limit on learning non-Markovian behavior: Fisher information  rate and excess information

We propose a Thompson sampling-based learning algorithm for the Linear
Quadratic (LQ) control problem with unknown system parameters. The algorithm is
called Thompson sampling with dynamic episodes (TSDE) where two stopping
criteria determine the lengths of the dynamic episodes in Thompson sampling.
The first stopping criterion controls the growth rate of episode length. The
second stopping criterion is triggered when the determinant of the sample
covariance matrix is less than half of the previous value. We show under some
conditions on the prior distribution that the expected (Bayesian) regret of
TSDE accumulated up to time T is bounded by O(\sqrt{T}). Here O(.) hides
constants and logarithmic factors. This is the first O(\sqrt{T} ) bound on
expected regret of learning in LQ control. By introducing a reinitialization
schedule, we also show that the algorithm is robust to time-varying drift in
model parameters. Numerical simulations are provided to illustrate the
performance of TSDE.

引入 Thompson 采样算法应对 LQ 控制问题的未知系统参数，该算法被称为具有动态阶段的 Thompson 采样（TSDE），其中包括两种停止准则来确定动态阶段的长度并呈现出具有 O (sqrt (T)) 的期望后悔值的性质，加入重启计划也展示了对于模型参数的时间变化具有稳健性。