System identification, also known as learning forward models, transfer
functions, system dynamics, etc., has a long tradition both in science and
engineering in different fields. Particularly, it is a recurring theme in
Reinforcement Learning research, where forward models approximate the state
transition function of a Markov Decision Process by learning a mapping function
from current state and action to the next state. This problem is commonly
defined as a Supervised Learning problem in a direct way. This common approach
faces several difficulties due to the inherent complexities of the dynamics to
learn, for example, delayed effects, high non-linearity, non-stationarity,
partial observability and, more important, error accumulation when using
bootstrapped predictions (predictions based on past predictions), over large
time horizons. Here we explore the use of Reinforcement Learning in this
problem. We elaborate on why and how this problem fits naturally and sound as a
Reinforcement Learning problem, and present some experimental results that
demonstrate RL is a promising technique to solve these kind of problems.

该论文探讨了在强化学习领域中，如何通过学习前向模型（也称转移函数、系统动力学）来近似马尔可夫决策过程中状态转移函数的过程，并提出了利用强化学习技术解决复杂动态系统学习问题的实验结果

系统辨识中的强化学习

Reinforcement Learning in System Identification

Action planning using learned and differentiable forward models of the world
is a general approach which has a number of desirable properties, including
improved sample complexity over model-free RL methods, reuse of learned models
across different tasks, and the ability to perform efficient gradient-based
optimization in continuous action spaces. However, this approach does not apply
straightforwardly when the action space is discrete. In this work, we show that
it is in fact possible to effectively perform planning via backprop in discrete
action spaces, using a simple paramaterization of the actions vectors on the
simplex combined with input noise when training the forward model. Our
experiments show that this approach can match or outperform model-free RL and
discrete planning methods on gridworld navigation tasks in terms of performance
and/or planning time while using limited environment interactions, and can
additionally be used to perform model-based control in a challenging new task
where the action space combines discrete and continuous actions. We furthermore
propose a policy distillation approach which yields a fast policy network which
can be used at inference time, removing the need for an iterative planning
procedure.

本文介绍了一种使用前向模型的行动计划方法，在离散动作空间中通过反向传播实现规划，使用参数化的动作向量和输入噪声，同时使用策略蒸馏方法，性能优于模型自由 RL 和离散计划方法，可以应用于离散和连续动作空间的模型控制任务。