While contemporary reinforcement learning research and applications have embraced policy gradient methods as the panacea of solving learning problems, value-based methods can still be useful in many domains as long as we can wrangle with how to exploit them in a sample efficient way. In this paper, we explore the chaotic nature of DQNs in reinforcement learning, while understanding how the information that they retain when trained can be repurposed for adapting a model to different tasks. We start by designing a simple experiment in which we are able to observe the Q-values for each state and action in an environment. Then we train in eight different ways to explore how these training algorithms affect the way that accurate Q-values are learned (or not learned). We tested the adaptability of each trained model when retrained to accomplish a slightly modified task. We then scaled our setup to test the larger problem of an autonomous vehicle at an unprotected intersection. We observed that the model is able to adapt to new tasks quicker when the base model's Q-value estimates are closer to the true Q-values. The results provide some insights and guidelines into what algorithms are useful for sample efficient task adaptation.

当代强化学习研究已广泛采用策略梯度方法作为解决学习问题的万能方法，然而只要我们能高效地利用它们，基于价值的方法在许多领域仍然有用。本文探讨了DQNs在强化学习中的混沌性质，同时理解了当训练时它们所保留的信息如何被改造用于适应不同任务的模型。我们从设计一个简单的实验开始，观察环境中每个状态和动作的Q值。然后我们通过不同的训练方式进行训练，探索这些训练算法如何影响准确学习（或未学习）Q值的方式。我们测试了每个训练模型在重新训练以完成稍微改变的任务时的适应性。然后我们扩展实验设置，测试一个无保护路口上的自动驾驶问题。我们观察到，当基础模型的Q值估计接近真实Q值时，模型能更快地适应新任务。结果提供了一些关于哪些算法对于高效适应任务有用的见解和指导。

适应新任务的强化学习智能体：基于Q-值的洞察