Model-based reinforcement learning promises to learn an optimal policy from
fewer interactions with the environment compared to model-free reinforcement
learning by learning an intermediate model of the environment in order to
predict future interactions. When predicting a sequence of interactions, the
rollout length, which limits the prediction horizon, is a critical
hyperparameter as accuracy of the predictions diminishes in the regions that
are further away from real experience. As a result, with a longer rollout
length, an overall worse policy is learned in the long run. Thus, the
hyperparameter provides a trade-off between quality and efficiency. In this
work, we frame the problem of tuning the rollout length as a meta-level
sequential decision-making problem that optimizes the final policy learned by
model-based reinforcement learning given a fixed budget of environment
interactions by adapting the hyperparameter dynamically based on feedback from
the learning process, such as accuracy of the model and the remaining budget of
interactions. We use model-free deep reinforcement learning to solve the
meta-level decision problem and demonstrate that our approach outperforms
common heuristic baselines on two well-known reinforcement learning
environments.

本文将调整 rollout length 作为元策略决策问题，通过动态改变超参数来优化在固定环境互动预算下通过模型强化学习学习到的最终策略，使用深度强化学习解决元策略决策问题，并在两个常见的强化学习环境中展示了其优势。