We resolve the min-max complexity of distributed stochastic convex
optimization (up to a log factor) in the intermittent communication setting,
where $M$ machines work in parallel over the course of $R$ rounds of
communication to optimize the objective, and during each round of
communication, each machine may sequentially compute $K$ stochastic gradient
estimates. We present a novel lower bound with a matching upper bound that
establishes an optimal algorithm.

研究分布式随机凸优化的最小最大复杂度，在间歇通信设置下提出了一种新的下限和上限算法，以确定最佳算法。

分布式随机凸优化的极小 - 极大复杂度及间歇通信

The Min-Max Complexity of Distributed Stochastic Convex Optimization  with Intermittent Communication

Variational Bayesian Inference is a popular methodology for approximating
posterior distributions over Bayesian neural network weights. Recent work
developing this class of methods has explored ever richer parameterizations of
the approximate posterior in the hope of improving performance. In contrast,
here we share a curious experimental finding that suggests instead restricting
the variational distribution to a more compact parameterization. For a variety
of deep Bayesian neural networks trained using Gaussian mean-field variational
inference, we find that the posterior standard deviations consistently exhibit
strong low-rank structure after convergence. This means that by decomposing
these variational parameters into a low-rank factorization, we can make our
variational approximation more compact without decreasing the models'
performance. Furthermore, we find that such factorized parameterizations
improve the signal-to-noise ratio of stochastic gradient estimates of the
variational lower bound, resulting in faster convergence.

通过对高斯均值场变分推理方法训练的深层贝叶斯神经网络的后验标准差进行矩阵低秩分解，我们可以将变分推理方法更紧凑地参数化，并提高其信噪比，从而加速其收敛速度。

k-tied 正态分布：贝叶斯神经网络中高斯均值场后验的紧凑参数化

The k-tied Normal Distribution: A Compact Parameterization of Gaussian  Mean Field Posteriors in Bayesian Neural Networks

Online optimization has been a successful framework for solving large-scale
problems under computational constraints and partial information. Current
methods for online convex optimization require either a projection or exact
gradient computation at each step, both of which can be prohibitively expensive
for large-scale applications. At the same time, there is a growing trend of
non-convex optimization in machine learning community and a need for online
methods. Continuous DR-submodular functions, which exhibit a natural
diminishing returns condition, have recently been proposed as a broad class of
non-convex functions which may be efficiently optimized. Although online
methods have been introduced, they suffer from similar problems. In this work,
we propose Meta-Frank-Wolfe, the first online projection-free algorithm that
uses stochastic gradient estimates. The algorithm relies on a careful sampling
of gradients in each round and achieves the optimal $O( \sqrt{T})$ adversarial
regret bounds for convex and continuous submodular optimization. We also
propose One-Shot Frank-Wolfe, a simpler algorithm which requires only a single
stochastic gradient estimate in each round and achieves an $O(T^{2/3})$
stochastic regret bound for convex and continuous submodular optimization. We
apply our methods to develop a novel "lifting" framework for the online
discrete submodular maximization and also see that they outperform current
state-of-the-art techniques on various experiments.

该论文提出了一种新颖的元 Frank-Wolfe 算法及其简化版 One-Shot-Frank-Wolfe，用于对在线优化进行全局和子模最优解的快速求解。其方法基于梯度下降实现，通过随机梯度估算和孪生逼近算法来降低收敛难度。

基于随机梯度的无需投影的在线优化：从凸性到次模性

Projection-Free Online Optimization with Stochastic Gradient: From  Convexity to Submodularity

In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH),
as well as its practical variant SARAH+, as a novel approach to the finite-sum
minimization problems. Different from the vanilla SGD and other modern
stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple
recursive framework for updating stochastic gradient estimates; when comparing
to SAG/SAGA, SARAH does not require a storage of past gradients. The linear
convergence rate of SARAH is proven under strong convexity assumption. We also
prove a linear convergence rate (in the strongly convex case) for an inner loop
of SARAH, the property that SVRG does not possess. Numerical experiments
demonstrate the efficiency of our algorithm.

本文提出了一种名为 SARAH 的随机递归梯度算法及其改进版 SARAH +，以优化有限累加和问题，并证明了该算法在强凸情况下具有线性收敛速率。