Goal-conditioned Reinforcement Learning (RL) aims at learning optimal
policies, given goals encoded in special command inputs. Here we study
goal-conditioned neural nets (NNs) that learn to generate deep NN policies in
form of context-specific weight matrices, similar to Fast Weight Programmers
and other methods from the 1990s. Using context commands of the form "generate
a policy that achieves a desired expected return," our NN generators combine
powerful exploration of parameter space with generalization across commands to
iteratively find better and better policies. A form of weight-sharing
HyperNetworks and policy embeddings scales our method to generate deep NNs.
Experiments show how a single learned policy generator can produce policies
that achieve any return seen during training. Finally, we evaluate our
algorithm on a set of continuous control tasks where it exhibits competitive
performance. Our code is public.

研究探讨了目标条件强化学习，使用上下文命令生成生成深度神经网络策略的权重矩阵的目标条件神经网络，并使用超网络和策略嵌入来扩展该方法以生成深层神经网络。通过实验证明，单个生成的策略生成器可以产生在训练过程中观察到的任何回报的策略，并且该算法在一组连续控制任务中表现出有竞争力的性能。