A key challenge in training generally-capable agents is the design of
training tasks that facilitate broad generalization and robustness to
environment variations. This challenge motivates the problem setting of
Unsupervised Environment Design (UED), whereby a student agent trains on an
adaptive distribution of tasks proposed by a teacher agent. A pioneering
approach for UED is PAIRED, which uses reinforcement learning (RL) to train a
teacher policy to design tasks from scratch, making it possible to directly
generate tasks that are adapted to the agent's current capabilities. Despite
its strong theoretical backing, PAIRED suffers from a variety of challenges
that hinder its practical performance. Thus, state-of-the-art methods currently
rely on curation and mutation rather than generation of new tasks. In this
work, we investigate several key shortcomings of PAIRED and propose solutions
for each shortcoming. As a result, we make it possible for PAIRED to match or
exceed state-of-the-art methods, producing robust agents in several established
challenging procedurally-generated environments, including a partially-observed
maze navigation task and a continuous-control car racing environment. We
believe this work motivates a renewed emphasis on UED methods based on learned
models that directly generate challenging environments, potentially unlocking
more open-ended RL training and, as a result, more general agents.

训练任务、无监督环境设计、PARED、最新方法、开放式强化学习训练。

用学习对手稳定非监督环境设计

Stabilizing Unsupervised Environment Design with a Learned Adversary

In meta reinforcement learning (meta RL), an agent learns from a set of
training tasks how to quickly solve a new task, drawn from the same task
distribution. The optimal meta RL policy, a.k.a. the Bayes-optimal behavior, is
well defined, and guarantees optimal reward in expectation, taken with respect
to the task distribution. The question we explore in this work is how many
training tasks are required to guarantee approximately optimal behavior with
high probability. Recent work provided the first such PAC analysis for a
model-free setting, where a history-dependent policy was learned from the
training tasks. In this work, we propose a different approach: directly learn
the task distribution, using density estimation techniques, and then train a
policy on the learned task distribution. We show that our approach leads to
bounds that depend on the dimension of the task distribution. In particular, in
settings where the task distribution lies in a low-dimensional manifold, we
extend our analysis to use dimensionality reduction techniques and account for
such structure, obtaining significantly better bounds than previous work, which
strictly depend on the number of states and actions. The key of our approach is
the regularization implied by the kernel density estimation method. We further
demonstrate that this regularization is useful in practice, when `plugged in'
the state-of-the-art VariBAD meta RL algorithm.

本研究探讨使用密度估计技术，直接学习任务分布并在其上训练策略以最大化回报，从而实现元强化学习的有效性问题，结果表明，与基于历史策略的学习方法相比，我们的方法具有更好的效果，特别是在任务分布存在低维流形的情况下。

有限训练任务下的元强化学习 —— 一种密度估计方法

Meta Reinforcement Learning with Finite Training Tasks -- a Density Estimation Approach

Large-scale models for learning fixed-dimensional cross-lingual sentence
representations like LASER (Artetxe and Schwenk, 2019b) lead to significant
improvement in performance on downstream tasks. However, further increases and
modifications based on such large-scale models are usually impractical due to
memory limitations. In this work, we introduce a lightweight dual-transformer
architecture with just 2 layers for generating memory-efficient cross-lingual
sentence representations. We explore different training tasks and observe that
current cross-lingual training tasks leave a lot to be desired for this shallow
architecture. To ameliorate this, we propose a novel cross-lingual language
model, which combines the existing single-word masked language model with the
newly proposed cross-lingual token-level reconstruction task. We further
augment the training task by the introduction of two computationally-lite
sentence-level contrastive learning tasks to enhance the alignment of
cross-lingual sentence representation space, which compensates for the learning
bottleneck of the lightweight transformer for generative tasks. Our comparisons
with competing models on cross-lingual sentence retrieval and multilingual
document classification confirm the effectiveness of the newly proposed
training tasks for a shallow model.

本文介绍了一种轻量级双变压器体系结构，用于生成记忆高效的跨语言句子表示。同时，还提出了一种新的跨语言语言模型，并引入了两个计算良好的句子级对比学习任务，以提高跨语言句子表示空间的对齐度，从而补偿生成任务的学习瓶颈。实验结果表明，在跨语言句子检索和多语言文档分类方面，与竞争模型相比，我们提出的新的训练任务有效性更高。