Bi-level optimization model is able to capture a wide range of complex
learning tasks with practical interest. Due to the witnessed efficiency in
solving bi-level programs, gradient-based methods have gained popularity in the
machine learning community. In this work, we propose a new gradient-based
solution scheme, namely, the Bi-level Value-Function-based Interior-point
Method (BVFIM). Following the main idea of the log-barrier interior-point
scheme, we penalize the regularized value function of the lower level problem
into the upper level objective. By further solving a sequence of differentiable
unconstrained approximation problems, we consequently derive a sequential
programming scheme. The numerical advantage of our scheme relies on the fact
that, when gradient methods are applied to solve the approximation problem, we
successfully avoid computing any expensive Hessian-vector or Jacobian-vector
product. We prove the convergence without requiring any convexity assumption on
either the upper level or the lower level objective. Experiments demonstrate
the efficiency of the proposed BVFIM on non-convex bi-level problems.

本文提出了一种新的基于值函数内点法的双层优化模型求解方法 BVFIM，通过对正则化值函数进行惩罚，进而获得一个连续可微的无约束逼近问题序列，解决了复杂的学习问题，数值实验验证了该方法的高效性和信噪比。

一种基于价值函数的内点法用于非凸双层优化

A Value-Function-based Interior-point Method for Non-convex Bi-level  Optimization

Recent advances in combining deep neural network architectures with
reinforcement learning techniques have shown promising potential results in
solving complex control problems with high dimensional state and action spaces.
Inspired by these successes, in this paper, we build two kinds of reinforcement
learning algorithms: deep policy-gradient and value-function based agents which
can predict the best possible traffic signal for a traffic intersection. At
each time step, these adaptive traffic light control agents receive a snapshot
of the current state of a graphical traffic simulator and produce control
signals. The policy-gradient based agent maps its observation directly to the
control signal, however the value-function based agent first estimates values
for all legal control signals. The agent then selects the optimal control
action with the highest value. Our methods show promising results in a traffic
network simulated in the SUMO traffic simulator, without suffering from
instability issues during the training process.

本文的研究探究了在使用深度神经网络架构和强化学习技术组合的方法来解决高维状态和行为空间下的复杂控制问题方面的最新进展，并构建了两种基于强化学习的算法：基于策略梯度和基于值函数的代理，以预测交通路口最佳的信号灯状态，通过 SUMO 交通模拟器实验表明，本方法在训练过程中没有出现不稳定问题。