Model-based offline reinforcement learning (RL), which builds a supervised
transition model with logging dataset to avoid costly interactions with the
online environment, has been a promising approach for offline policy
optimization. As the discrepancy between the logging data and online
environment may result in a distributional shift problem, many prior works have
studied how to build robust transition models conservatively and estimate the
model uncertainty accurately. However, the over-conservatism can limit the
exploration of the agent, and the uncertainty estimates may be unreliable. In
this work, we propose a novel Model-based Offline policy optimization framework
with Adversarial Network (MOAN). The key idea is to use adversarial learning to
build a transition model with better generalization, where an adversary is
introduced to distinguish between in-distribution and out-of-distribution
samples. Moreover, the adversary can naturally provide a quantification of the
model's uncertainty with theoretical guarantees. Extensive experiments showed
that our approach outperforms existing state-of-the-art baselines on widely
studied offline RL benchmarks. It can also generate diverse in-distribution
samples, and quantify the uncertainty more accurately.

使用对抗学习建立具有更好泛化性能的转移模型，能够更准确地量化模型不确定性，并在广泛研究的离线强化学习基准测试中胜过现有最先进的对照方法。

基于模型的离线策略优化与对抗网络

Model-based Offline Policy Optimization with Adversarial Network

Reinforcement learning algorithms usually assume that all actions are always
available to an agent. However, both people and animals understand the general
link between the features of their environment and the actions that are
feasible. Gibson (1977) coined the term "affordances" to describe the fact that
certain states enable an agent to do certain actions, in the context of
embodied agents. In this paper, we develop a theory of affordances for agents
who learn and plan in Markov Decision Processes. Affordances play a dual role
in this case. On one hand, they allow faster planning, by reducing the number
of actions available in any given situation. On the other hand, they facilitate
more efficient and precise learning of transition models from data, especially
when such models require function approximation. We establish these properties
through theoretical results as well as illustrative examples. We also propose
an approach to learn affordances and use it to estimate transition models that
are simpler and generalize better.

本文提出了一种基于 Markov Decision Processes 的行动效益理论，它能够加速计划过程，同时提高学习效率和准确性，特别是在需要函数逼近的模型中。文中还介绍了一种学习行动效益的方法，并将其用于估计更简单且泛化能力更强的状态转移模型。

强化学习中的可行性原理理论

What can I do here? A Theory of Affordances in Reinforcement Learning

Combining deep model-free reinforcement learning with on-line planning is a
promising approach to building on the successes of deep RL. On-line planning
with look-ahead trees has proven successful in environments where transition
models are known a priori. However, in complex environments where transition
models need to be learned from data, the deficiencies of learned models have
limited their utility for planning. To address these challenges, we propose
TreeQN, a differentiable, recursive, tree-structured model that serves as a
drop-in replacement for any value function network in deep RL with discrete
actions. TreeQN dynamically constructs a tree by recursively applying a
transition model in a learned abstract state space and then aggregating
predicted rewards and state-values using a tree backup to estimate Q-values. We
also propose ATreeC, an actor-critic variant that augments TreeQN with a
softmax layer to form a stochastic policy network. Both approaches are trained
end-to-end, such that the learned model is optimised for its actual use in the
tree. We show that TreeQN and ATreeC outperform n-step DQN and A2C on a
box-pushing task, as well as n-step DQN and value prediction networks (Oh et
al. 2017) on multiple Atari games. Furthermore, we present ablation studies
that demonstrate the effect of different auxiliary losses on learning
transition models.

本研究介绍了一种新的基于在线计划的树形结构模型 TreeQN，并且通过在多种游戏环境中的实验表明 TreeQN 和 ATreeC 模型具备优秀的性能。