Discovering achievements with a hierarchical structure on procedurally
generated environments poses a significant challenge. This requires agents to
possess a broad range of abilities, including generalization and long-term
reasoning. Many prior methods are built upon model-based or hierarchical
approaches, with the belief that an explicit module for long-term planning
would be beneficial for learning hierarchical achievements. However, these
methods require an excessive amount of environment interactions or large model
sizes, limiting their practicality. In this work, we identify that proximal
policy optimization (PPO), a simple and versatile model-free algorithm,
outperforms the prior methods with recent implementation practices. Moreover,
we find that the PPO agent can predict the next achievement to be unlocked to
some extent, though with low confidence. Based on this observation, we propose
a novel contrastive learning method, called achievement distillation, that
strengthens the agent's capability to predict the next achievement. Our method
exhibits a strong capacity for discovering hierarchical achievements and shows
state-of-the-art performance on the challenging Crafter environment using fewer
model parameters in a sample-efficient regime.

在本研究中，我们发现了一种名为近似策略优化（PPO）的简单而多功能的无模型算法，其比之前的方法在最近的实施实践中表现更好。此外，我们还发现 PPO 代理在某种程度上能够预测下一个要解锁的成就，尽管置信度较低。基于这一发现，我们提出了一种名为成就蒸馏的新颖对比学习方法，该方法增强了代理预测下一个成就的能力。我们的方法在挑战性的 Crafter 环境中表现出强大的发现层次成就的能力，并以更少的模型参数在样本高效的范围内展示了最先进的性能。

通过对比学习在强化学习中发现层次成就

Discovering Hierarchical Achievements in Reinforcement Learning via  Contrastive Learning

In this paper, we investigate the problem of overfitting in deep
reinforcement learning. Among the most common benchmarks in RL, it is customary
to use the same environments for both training and testing. This practice
offers relatively little insight into an agent's ability to generalize. We
address this issue by using procedurally generated environments to construct
distinct training and test sets. Most notably, we introduce a new environment
called CoinRun, designed as a benchmark for generalization in RL. Using
CoinRun, we find that agents overfit to surprisingly large training sets. We
then show that deeper convolutional architectures improve generalization, as do
methods traditionally found in supervised learning, including L2
regularization, dropout, data augmentation and batch normalization.

本文研究了深度强化学习中的过拟合问题，并使用程序生成的环境来构建不同的训练和测试集，其中引入了一个名为 CoinRun 的新环境，用作强化学习中泛化的基准。使用 CoinRun，作者发现代理程序会对相当大的训练集过拟合，还展示了更深层次的卷积体系结构以及传统监督学习中的方法，包括 L2 正则化，dropout，数据增强和批标准化等，能够提高泛化能力。