We present a neural architecture search algorithm to construct compact
reinforcement learning (RL) policies, by combining ENAS and ES in a highly
scalable and intuitive way. By defining the combinatorial search space of NAS
to be the set of different edge-partitionings (colorings) into same-weight
classes, we represent compact architectures via efficient learned
edge-partitionings. For several RL tasks, we manage to learn colorings
translating to effective policies parameterized by as few as $17$ weight
parameters, providing >90% compression over vanilla policies and 6x compression
over state-of-the-art compact policies based on Toeplitz matrices, while still
maintaining good reward. We believe that our work is one of the first attempts
to propose a rigorous approach to training structured neural network
architectures for RL problems that are of interest especially in mobile
robotics with limited storage and computational resources.

本文提出了一种神经架构搜索算法，结合 ENAS 和 ES 来构建紧凑的强化学习策略。该算法在包括机器人移动领域在内的 RL 问题中，提出了一种训练结构化神经网络架构的严格方法，可通过学习高效的边缘分区来表示紧凑架构。在多项 RL 任务中，该算法在权重参数最少为 17 个的条件下提供了 > 90% 的压缩率。

使用色彩网络进行紧凑型架构搜索的强化学习

Reinforcement Learning with Chromatic Networks for Compact Architecture  Search

We present a new method of blackbox optimization via gradient approximation
with the use of structured random orthogonal matrices, providing more accurate
estimators than baselines and with provable theoretical guarantees. We show
that this algorithm can be successfully applied to learn better quality compact
policies than those using standard gradient estimation techniques. The compact
policies we learn have several advantages over unstructured ones, including
faster training algorithms and faster inference. These benefits are important
when the policy is deployed on real hardware with limited resources. Further,
compact policies provide more scalable architectures for derivative-free
optimization (DFO) in high-dimensional spaces. We show that most robotics tasks
from the OpenAI Gym can be solved using neural networks with less than 300
parameters, with almost linear time complexity of the inference phase, with up
to 13x fewer parameters relative to the Evolution Strategies (ES) algorithm
introduced by Salimans et al. (2017). We do not need heuristics such as fitness
shaping to learn good quality policies, resulting in a simple and theoretically
motivated training mechanism.

采用结构化随机正交矩阵的梯度逼近黑盒优化方法可以学习到比标准梯度估算技术更好的紧凑策略，从而提高了在资源有限的实际硬件上的速度和可伸缩性。