The performance of reinforcement learning (RL) algorithms is sensitive to the choice of hyperparameters, with the learning rate being particularly influential. RL algorithms fail to reach convergence or demand an extensive number of samples when the learning rate is not optimally set. In this work, we show that model selection can help to improve the failure modes of RL that are due to suboptimal choices of learning rate. We present a model selection framework for Learning Rate-Free Reinforcement Learning that employs model selection methods to select the optimal learning rate on the fly. This approach of adaptive learning rate tuning neither depends on the underlying RL algorithm nor the optimizer and solely uses the reward feedback to select the learning rate; hence, the framework can input any RL algorithm and produce a learning rate-free version of it. We conduct experiments for policy optimization methods and evaluate various model selection strategies within our framework. Our results indicate that data-driven model selection algorithms are better alternatives to standard bandit algorithms when the optimal choice of hyperparameter is time-dependent and non-stationary.

本研究解决了强化学习算法在超参数选择上的敏感性，特别是学习率设置不当导致的收敛失败问题。提出了一种无学习率强化学习的模型选择框架，通过实时选择最佳学习率来提高算法性能。实验结果表明，在超参数的最优选择是时间依赖且非平稳时，数据驱动的模型选择算法优于传统的强盗算法。

无学习率强化学习：非平稳目标下的模型选择案例