We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments, where the task may change over time. While RNN-based policies could in principle represent such strategies, in practice their training time is prohibitive and the learning process often converges to poor solutions. In this paper, we consider the case where the agent has access to a description of the task (e.g., a task id or task parameters) at training time, but not at test time. We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task. This dramatically reduces the sample complexity of training RNN-based policies, without losing their representational power. As a result, our method learns exploration strategies that efficiently balance between gathering information about the unknown and changing task and maximizing the reward over time. We test the performance of our algorithm in a variety of environments where tasks may vary within each episode.

本文研究了学习探索-利用策略来适应动态环境的问题，并提出了一种使用信息策略对RNN-based策略进行训练的新算法来规范化训练，从而显著减少了训练样本的复杂性。这种方法学习了一些探索策略，使其可以高效地平衡对于未知和变化的任务中获取信息以及随时间最大化回报的问题，并在多种环境中进行了测试。

通过知情策略正则化在动态环境下学习自适应探索策略