Existing methods for optimal control struggle to deal with the complexity
commonly encountered in real-world systems, including dimensionality, process
error, model bias and data heterogeneity. Instead of tackling these system
complexities directly, researchers have typically sought to simplify models to
fit optimal control methods. But when is the optimal solution to an
approximate, stylized model better than an approximate solution to a more
accurate model? While this question has largely gone unanswered owing to the
difficulty of finding even approximate solutions for complex models, recent
algorithmic and computational advances in deep reinforcement learning (DRL)
might finally allow us to address these questions. DRL methods have to date
been applied primarily in the context of games or robotic mechanics, which
operate under precisely known rules. Here, we demonstrate the ability for DRL
algorithms using deep neural networks to successfully approximate solutions
(the "policy function" or control rule) in a non-linear three-variable model
for a fishery without knowing or ever attempting to infer a model for the
process itself. We find that the reinforcement learning agent discovers an
effective simplification of the problem to obtain an interpretable control
rule. We show that the policy obtained with DRL is both more profitable and
more sustainable than any constant mortality policy -- the standard family of
policies considered in fishery management.

现有的最优控制方法在处理真实系统中常遇到的复杂性（包括维度、过程误差、模型偏差和数据异质性）时遇到困难。最优控制方法通常寻求简化模型以适应，但何时适用于近似的、简化的模型的最优解胜过适用于更准确的模型的近似解尚未得到充分回答。深度强化学习（DRL）中的算法和计算进展为我们解决这些问题提供了可能性。DRL 方法迄今主要应用于游戏或机器人机械学中，在确切已知规则下运行。我们在没有了解或企图推断进程模型的情况下，展示了使用深度神经网络进行 DRL 算法的能力，成功近似求解了一个非线性三变量渔业模型的解（“策略函数” 或控制规则）。我们发现强化学习代理通过简化问题来获得可解释的控制规则。我们展示了 DRL 所获得的策略在渔业管理中既更具利润性，也更具可持续性，胜过了任何固定死亡率政策 —— 渔业管理中的标准政策家族。

相当好的控制：何时近似解法比近似模型更好

Pretty darn good control: when are approximate solutions better than  approximate models

Reinforcement learning (RL) algorithms allow artificial agents to improve
their selection of actions to increase rewarding experiences in their
environments. Temporal Difference (TD) Learning -- a model-free RL method -- is
a leading account of the midbrain dopamine system and the basal ganglia in
reinforcement learning. These algorithms typically learn a mapping from the
agent's current sensed state to a selected action (known as a policy function)
via learning a value function (expected future rewards). TD Learning methods
have been very successful on a broad range of control tasks, but learning can
become intractably slow as the state space of the environment grows. This has
motivated methods that learn internal representations of the agent's state,
effectively reducing the size of the state space and restructuring state
representations in order to support generalization. However, TD Learning
coupled with an artificial neural network, as a function approximator, has been
shown to fail to learn some fairly simple control tasks, challenging this
explanation of reward-based learning. We hypothesize that such failures do not
arise in the brain because of the ubiquitous presence of lateral inhibition in
the cortex, producing sparse distributed internal representations that support
the learning of expected future reward. The sparse conjunctive representations
can avoid catastrophic interference while still supporting generalization. We
provide support for this conjecture through computational simulations,
demonstrating the benefits of learned sparse representations for three
problematic classic control tasks: Puddle-world, Mountain-car, and Acrobot.

本文探讨强化学习算法中的 TD Learning 和基础节疤核在强化学习中的作用，使用计算机模拟来验证利用稀疏的联合表示来学习在特定环境下获得预期奖励的好处。

在强化学习中学习稀疏表示

Learning sparse representations in reinforcement learning

One way to approach end-to-end autonomous driving is to learn a policy
function that maps from a sensory input, such as an image frame from a
front-facing camera, to a driving action, by imitating an expert driver, or a
reference policy. This can be done by supervised learning, where a policy
function is tuned to minimize the difference between the predicted and
ground-truth actions. A policy function trained in this way however is known to
suffer from unexpected behaviours due to the mismatch between the states
reachable by the reference policy and trained policy functions. More advanced
algorithms for imitation learning, such as DAgger, addresses this issue by
iteratively collecting training examples from both reference and trained
policies. These algorithms often requires a large number of queries to a
reference policy, which is undesirable as the reference policy is often
expensive. In this paper, we propose an extension of the DAgger, called
SafeDAgger, that is query-efficient and more suitable for end-to-end autonomous
driving. We evaluate the proposed SafeDAgger in a car racing simulator and show
that it indeed requires less queries to a reference policy. We observe a
significant speed up in convergence, which we conjecture to be due to the
effect of automated curriculum learning.

本论文介绍了一种名为 SafeDAgger 的基于 DAgger 算法的自动驾驶智能学习方法，能够有效地减少对参考策略的查询次数，加快收敛速度。