Classical global convergence results for first-order methods rely on uniform
smoothness and the \L{}ojasiewicz inequality. Motivated by properties of
objective functions that arise in machine learning, we propose a non-uniform
refinement of these notions, leading to \emph{Non-uniform Smoothness} (NS) and
\emph{Non-uniform \L{}ojasiewicz inequality} (N\L{}). The new definitions
inspire new geometry-aware first-order methods that are able to converge to
global optimality faster than the classical $\Omega(1/t^2)$ lower bounds. To
illustrate the power of these geometry-aware methods and their corresponding
non-uniform analysis, we consider two important problems in machine learning:
policy gradient optimization in reinforcement learning (PG), and generalized
linear model training in supervised learning (GLM). For PG, we find that
normalizing the gradient ascent method can accelerate convergence to
$O(e^{-t})$ while incurring less overhead than existing algorithms. For GLM, we
show that geometry-aware normalized gradient descent can also achieve a linear
convergence rate, which significantly improves the best known results. We
additionally show that the proposed geometry-aware descent methods escape
landscape plateaus faster than standard gradient descent. Experimental results
are used to illustrate and complement the theoretical findings.

通过非统一的平滑性和非统一的 Lojasiewicz 不等式，提出了一些新的方法，用于更快地达到全局最优点，在强化学习和监督学习中表现出了广泛的适用性及实验效果。

利用非均匀性进行一阶非凸优化

Leveraging Non-uniformity in First-order Non-convex Optimization

Although deep convolutional neural networks achieve state-of-the-art
performance across nearly all image classification tasks, their decisions are
difficult to interpret. One approach that offers some level of interpretability
by design is \textit{hard attention}, which uses only relevant portions of the
image. However, training hard attention models with only class label
supervision is challenging, and hard attention has proved difficult to scale to
complex datasets. Here, we propose a novel hard attention model, which we term
Saccader. Key to Saccader is a pretraining step that requires only class labels
and provides initial attention locations for policy gradient optimization. Our
best models narrow the gap to common ImageNet baselines, achieving $75\%$ top-1
and $91\%$ top-5 while attending to less than one-third of the image.

通过 Saccader 硬关注模型，基于类标签和策略梯度优化算法，准确分类图像并显示只注视图像的部分，性能达到了接近 ImageNet 基准的 75% 和 91% 的 Top-1 和 Top-5

Saccader: 改进视觉硬注意力模型的准确性

Saccader: Improving Accuracy of Hard Attention Models for Vision

We develop a framework for convexifying a fairly general class of
optimization problems. Under additional assumptions, we analyze the
suboptimality of the solution to the convexified problem relative to the
original nonconvex problem and prove additive approximation guarantees. We then
develop algorithms based on stochastic gradient methods to solve the resulting
optimization problems and show bounds on convergence rates. %We show a simple
application of this framework to supervised learning, where one can perform
integration explicitly and can use standard (non-stochastic) optimization
algorithms with better convergence guarantees. We then extend this framework to
apply to a general class of discrete-time dynamical systems. In this context,
our convexification approach falls under the well-studied paradigm of
risk-sensitive Markov Decision Processes. We derive the first known model-based
and model-free policy gradient optimization algorithms with guaranteed
convergence to the optimal solution. Finally, we present numerical results
validating our formulation in different applications.

提出了一个凸化框架，使用随机梯度方法的算法来解决不同领域的优化问题，包括监督学习和动态系统，并且导出了模型驱动和模型无关的策略梯度优化算法，收敛性得到保证。