We develop a method for policy architecture search and adaptation via
gradient-free optimization which can learn to perform autonomous driving tasks.
By learning from both demonstration and environmental reward we develop a model
that can learn with relatively few early catastrophic failures. We first learn
an architecture of appropriate complexity to perceive aspects of world state
relevant to the expert demonstration, and then mitigate the effect of
domain-shift during deployment by adapting a policy demonstrated in a source
domain to rewards obtained in a target environment. We show that our approach
allows safer learning than baseline methods, offering a reduced cumulative
crash metric over the agent's lifetime as it learns to drive in a realistic
simulated environment.

通过梯度自由优化实现政策体系结构搜索和适应，可以学习执行自主驾驶任务。通过从演示和环境奖励中学习，开发了一个模型，可以学习相对较少的早期灾难性失败，并学习适当复杂度的体系结构，调整源领域中演示的政策以适应目标环境中获得的奖励，在逼真的模拟环境中学习驾驶，以提供比基线方法更安全的学习，降低累计崩溃指标。