Incorporating expert demonstrations has empirically helped to improve the
sample efficiency of reinforcement learning (RL). This paper quantifies
theoretically to what extent this extra information reduces RL's sample
complexity. In particular, we study the demonstration-regularized reinforcement
learning that leverages the expert demonstrations by KL-regularization for a
policy learned by behavior cloning. Our findings reveal that using
$N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal
policy at a sample complexity of order
$\widetilde{\mathcal{O}}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$
in finite and $\widetilde{\mathcal{O}}(\mathrm{Poly}(d,H)/(\varepsilon^2
N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is
the target precision, $H$ the horizon, $A$ the number of action, $S$ the number
of states in the finite case and $d$ the dimension of the feature space in the
linear case. As a by-product, we provide tight convergence guarantees for the
behaviour cloning procedure under general assumptions on the policy classes.
Additionally, we establish that demonstration-regularized methods are provably
efficient for reinforcement learning from human feedback (RLHF). In this
respect, we provide theoretical evidence showing the benefits of
KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid
pessimism injection by employing computationally feasible regularization to
handle reward estimation uncertainty, thus setting our approach apart from the
prior works.

利用专家演示来改善强化学习的样本效率，本研究量化了额外信息在降低样本复杂度方面的理论效果，并证明了 KL - 正则化方法在处理人类反馈强化学习中的优势。

演示调整的强化学习

Demonstration-Regularized RL

When assisting human users in reinforcement learning (RL), we can represent
users as RL agents and study key parameters, called \emph{user traits}, to
inform intervention design. We study the relationship between user behaviors
(policy classes) and user traits. Given an environment, we introduce an
intuitive tool for studying the breakdown of "user types": broad sets of traits
that result in the same behavior. We show that seemingly different real-world
environments admit the same set of user types and formalize this observation as
an equivalence relation defined on environments. By transferring intervention
design between environments within the same equivalence class, we can help
rapidly personalize interventions.

针对辅助人类用户在强化学习中的应用，研究了被称为 "用户特征" 的关键参数，以指导干预设计，同时研究了用户行为 (策略类) 与用户特征之间的关系，通过构建一种易于理解的工具来研究 "用户类型" 的分解，证明了看似不同的现实环境存在相同的用户类型，并将其形式化为在环境上定义的等价关系，通过在同一等价类中的环境之间转移干预设计，可以帮助快速个性化干预。

发现用户类型：通过针对性任务行为映射用户特征在强化学习中

Discovering User Types: Mapping User Traits by Task-Specific Behaviors  in Reinforcement Learning

Recent progress in model selection raises the question of the fundamental
limits of these techniques. Under specific scrutiny has been model selection
for general contextual bandits with nested policy classes, resulting in a
COLT2020 open problem. It asks whether it is possible to obtain simultaneously
the optimal single algorithm guarantees over all policies in a nested sequence
of policy classes, or if otherwise this is possible for a trade-off
$\alpha\in[\frac{1}{2},1)$ between complexity term and time:
$\ln(|\Pi_m|)^{1-\alpha}T^\alpha$. We give a disappointing answer to this
question. Even in the purely stochastic regime, the desired results are
unobtainable. We present a Pareto frontier of up to logarithmic factors
matching upper and lower bounds, thereby proving that an increase in the
complexity term $\ln(|\Pi_m|)$ independent of $T$ is unavoidable for general
policy classes. As a side result, we also resolve a COLT2016 open problem
concerning second-order bounds in full-information games.

研究模型选择中遇到的问题，证明了在_nested policy classes_中，无论时限和复杂度如何权衡，都不能同时得到所有策略的最优算法保证，并且在纯随机环境下，无法获得所需的结果；同时在_full-information games_中也解决了一个开放性问题。