Policy optimization in unknown, sparse-reward environments with expensive and
limited interactions is challenging, and poses a need for effective
exploration. Motivated by complex navigation tasks that require real-world
training (when cheap simulators are not available), we consider an agent that
faces an unknown distribution of environments and must decide on an exploration
strategy, through a series of training environments, that can benefit policy
learning in a test environment drawn from the environment distribution. Most
existing approaches focus on fixed exploration strategies, while the few that
view exploration as a meta-optimization problem tend to ignore the need for
cost-efficient exploration. We propose a cost-aware Bayesian optimization
approach that efficiently searches over a class of dynamic subgoal-based
exploration strategies. The algorithm adjusts a variety of levers -- the
locations of the subgoals, the length of each episode, and the number of
replications per trial -- in order to overcome the challenges of sparse
rewards, expensive interactions, and noise. Our experimental evaluation
demonstrates that, when averaged across problem domains, the proposed algorithm
outperforms the meta-learning algorithm MAML by 19%, the hyperparameter tuning
method Hyperband by 23%, BO techniques EI and LCB by 24% and 22%, respectively.
We also provide a theoretical foundation and prove that the method
asymptotically identifies a near-optimal subgoal design from the search space.

本文提出了一种基于代价感知的贝叶斯优化方法，旨在通过动态子目标的一系列探索策略来克服稀疏奖励、高昂交互和噪声等挑战，实现在未知分布环境下的政策学习。在实验评估中，平均而言，所提出的算法在问题领域上的表现优于 MAML 元学习算法 19％，超参数调整方法 Hyperband 23％，BO 技术 EI 和 LCB 分别为 24％和 22％。