While reinforcement learning (RL) has made great advances in scalability,
exploration and partial observability are still active research topics. In
contrast, Bayesian RL (BRL) provides a principled answer to both state
estimation and the exploration-exploitation trade-off, but struggles to scale.
To tackle this challenge, BRL frameworks with various prior assumptions have
been proposed, with varied success. This work presents a
representation-agnostic formulation of BRL under partially observability,
unifying the previous models under one theoretical umbrella. To demonstrate its
practical significance we also propose a novel derivation, Bayes-Adaptive Deep
Dropout rl (BADDr), based on dropout networks. Under this parameterization, in
contrast to previous work, the belief over the state and dynamics is a more
scalable inference problem. We choose actions through Monte-Carlo tree search
and empirically show that our method is competitive with state-of-the-art BRL
methods on small domains while being able to solve much larger ones.

本文提出了一种表示无关的、针对部分可观测情况下的贝叶斯强化学习的理论框架，并提出了一种基于 dropout 网络的新方法 BADDr，旨在解决 BRL 方法在拓展性上存在的瓶颈，并证实其在处理规模较大的情况时的有效性。

BADDr: 基于贝叶斯适应性的深度 Dropout RL 用于 POMDPs

BADDr: Bayes-Adaptive Deep Dropout RL for POMDPs

Consider the following instance of the Offline Meta Reinforcement Learning
(OMRL) problem: given the complete training logs of $N$ conventional RL agents,
trained on $N$ different tasks, design a meta-agent that can quickly maximize
reward in a new, unseen task from the same task distribution. In particular,
while each conventional RL agent explored and exploited its own different task,
the meta-agent must identify regularities in the data that lead to effective
exploration/exploitation in the unseen task. Here, we take a Bayesian RL (BRL)
view, and seek to learn a Bayes-optimal policy from the offline data. Building
on the recent VariBAD BRL approach, we develop an off-policy BRL method that
learns to plan an exploration strategy based on an adaptive neural belief
estimate. However, learning to infer such a belief from offline data brings a
new identifiability issue we term MDP ambiguity. We characterize the problem,
and suggest resolutions via data collection and modification procedures.
Finally, we evaluate our framework on a diverse set of domains, including
difficult sparse reward tasks, and demonstrate learning of effective
exploration behavior that is qualitatively different from the exploration used
by any RL agent in the data.

通过离线数据，基于贝叶斯强化学习视角提出 Offline Meta Reinforcement Learning 问题，研究如何设计元智能体以快速最大化相同任务分布下不同任务中的奖励收益，探究探索策略、MDP 歧义以及稀疏奖励任务等相关问题，最终拥有超越离线数据中单个 RL 代理的探索策略。

离线元学习探索

Offline Meta Learning of Exploration

The explore{exploit dilemma is one of the central challenges in Reinforcement
Learning (RL). Bayesian RL solves the dilemma by providing the agent with
information in the form of a prior distribution over environments; however,
full Bayesian planning is intractable. Planning with the mean MDP is a common
myopic approximation of Bayesian planning. We derive a novel reward bonus that
is a function of the posterior distribution over environments, which, when
added to the reward in planning with the mean MDP, results in an agent which
explores efficiently and effectively. Although our method is similar to
existing methods when given an uninformative or unstructured prior, unlike
existing methods, our method can exploit structured priors. We prove that our
method results in a polynomial sample complexity and empirically demonstrate
its advantages in a structured exploration task.

提出了一种基于后验概率分布的奖励加成方法，用于在 Bayesian RL 中解决探索与利用之间的困境，实现高效且有效的探索，能够利用结构化的先验知识，并证明其具有多项式样本复杂度。