A key challenge for a reinforcement learning (RL) agent is to incorporate
external/expert1 advice in its learning. The desired goals of an algorithm that
can shape the learning of an RL agent with external advice include (a)
maintaining policy invariance; (b) accelerating the learning of the agent; and
(c) learning from arbitrary advice [3]. To address this challenge this paper
formulates the problem of incorporating external advice in RL as a multi-armed
bandit called shaping-bandits. The reward of each arm of shaping bandits
corresponds to the return obtained by following the expert or by following a
default RL algorithm learning on the true environment reward.We show that
directly applying existing bandit and shaping algorithms that do not reason
about the non-stationary nature of the underlying returns can lead to poor
results. Thus we propose UCB-PIES (UPIES), Racing-PIES (RPIES), and Lazy PIES
(LPIES) three different shaping algorithms built on different assumptions that
reason about the long-term consequences of following the expert policy or the
default RL algorithm. Our experiments in four different settings show that
these proposed algorithms achieve the above-mentioned goals whereas the other
algorithms fail to do so.

该论文提出一种名为 Shaping-Bandits 的多臂赌博问题来解决如何将外部建议纳入强化学习智能体的学习之中，并提出了三种不同的塑形算法，旨在考虑遵循专家策略或默认 RL 算法的长期后果。通过实验验证这些算法在四个不同的设置中实现了所述目标。

基于贝叶斯赌博机的策略不变显式塑形方法，用于融合外部建议的强化学习

Bandit-Based Policy Invariant Explicit Shaping for Incorporating  External Advice in Reinforcement Learning

Inverse Reinforcement Learning (IRL) is a powerful paradigm for inferring a
reward function from expert demonstrations. Many IRL algorithms require a known
transition model and sometimes even a known expert policy, or they at least
require access to a generative model. However, these assumptions are too strong
for many real-world applications, where the environment can be accessed only
through sequential interaction. We propose a novel IRL algorithm: Active
exploration for Inverse Reinforcement Learning (AceIRL), which actively
explores an unknown environment and expert policy to quickly learn the expert's
reward function and identify a good policy. AceIRL uses previous observations
to construct confidence intervals that capture plausible reward functions and
find exploration policies that focus on the most informative regions of the
environment. AceIRL is the first approach to active IRL with sample-complexity
bounds that does not require a generative model of the environment. AceIRL
matches the sample complexity of active IRL with a generative model in the
worst case. Additionally, we establish a problem-dependent bound that relates
the sample complexity of AceIRL to the suboptimality gap of a given IRL
problem. We empirically evaluate AceIRL in simulations and find that it
significantly outperforms more naive exploration strategies.

本文提出使用主动探索策略的逆强化学习算法（AceIRL），该算法通过构造置信区间捕捉潜在的奖励函数，寻找信息最充分的环境区域的探索策略，从而快速学习专家的奖励函数和制定一个良好的策略。AceIRL 是第一种不需要环境生成模型并具有样本复杂度界限的主动逆强化学习方法，并与具备环境生成模型情况下的样本复杂度相匹配，在模拟实验中证明 AceIRL 优于其他探索策略。

逆强化学习的主动探索

Active Exploration for Inverse Reinforcement Learning

Imitation learning (IL) aims to mimic the behavior of an expert policy in a
sequential decision-making problem given only demonstrations. In this paper, we
focus on understanding the minimax statistical limits of IL in episodic Markov
Decision Processes (MDPs). We first consider the setting where the learner is
provided a dataset of $N$ expert trajectories ahead of time, and cannot
interact with the MDP. Here, we show that the policy which mimics the expert
whenever possible is in expectation $\lesssim \frac{|\mathcal{S}| H^2 \log
(N)}{N}$ suboptimal compared to the value of the expert, even when the expert
follows an arbitrary stochastic policy. Here $\mathcal{S}$ is the state space,
and $H$ is the length of the episode. Furthermore, we establish a suboptimality
lower bound of $\gtrsim |\mathcal{S}| H^2 / N$ which applies even if the expert
is constrained to be deterministic, or if the learner is allowed to actively
query the expert at visited states while interacting with the MDP for $N$
episodes. To our knowledge, this is the first algorithm with suboptimality
having no dependence on the number of actions, under no additional assumptions.
We then propose a novel algorithm based on minimum-distance functionals in the
setting where the transition model is given and the expert is deterministic.
The algorithm is suboptimal by $\lesssim \min \{ H \sqrt{|\mathcal{S}| / N} ,\
|\mathcal{S}| H^{3/2} / N \}$, showing that knowledge of transition improves
the minimax rate by at least a $\sqrt{H}$ factor.

研究了在马尔可夫决策过程中，即使在给定数据集前提下，模仿专家政策的算法可能会存在次优性，并提出了一种基于最小距离函数的新算法，在确定性专家和已知转移模型的情况下，提高了最小极值速率。