We present a novel definition of the reinforcement learning state, actions
and reward function that allows a deep Q-network (DQN) to learn to control an
optimization hyperparameter. Using Q-learning with experience replay, we train
two DQNs to accept a state representation of an objective function as input and
output the expected discounted return of rewards, or q-values, connected to the
actions of either adjusting the learning rate or leaving it unchanged. The two
DQNs learn a policy similar to a line search, but differ in the number of
allowed actions. The trained DQNs in combination with a gradient-based update
routine form the basis of the Q-gradient descent algorithms. To demonstrate the
viability of this framework, we show that the DQN's q-values associated with
optimal action converge and that the Q-gradient descent algorithms outperform
gradient descent with an Armijo or nonmonotone line search. Unlike traditional
optimization methods, Q-gradient descent can incorporate any objective
statistic and by varying the actions we gain insight into the type of learning
rate adjustment strategies that are successful for neural network optimization.

本文介绍了一种新颖的强化学习状态、动作和奖励函数的定义，它允许深度 Q 网络（DQN）学习控制优化超参数。我们使用经验重放的 Q 学习，训练两个 DQN 接受目标函数状态表示作为输入，并输出与学习率调整或保持不变的动作相关的预期折扣回报，即 q 值。训练的 DQN 结合基于梯度的更新例程构成了 Q - 梯度下降算法的基础。与传统的优化方法不同，Q - 梯度下降可以结合任何目标统计量，通过变化动作，我们可以深入了解成功的神经网络优化的学习率调整策略。