Learning Rate Rewinding (LRR) has been established as a strong variant of
Iterative Magnitude Pruning (IMP) to find lottery tickets in deep
overparameterized neural networks. While both iterative pruning schemes couple
structure and parameter learning, understanding how LRR excels in both aspects
can bring us closer to the design of more flexible deep learning algorithms
that can optimize diverse sets of sparse architectures. To this end, we conduct
experiments that disentangle the effect of mask learning and parameter
optimization and how both benefit from overparameterization. The ability of LRR
to flip parameter signs early and stay robust to sign perturbations seems to
make it not only more effective in mask identification but also in optimizing
diverse sets of masks, including random ones. In support of this hypothesis, we
prove in a simplified single hidden neuron setting that LRR succeeds in more
cases than IMP, as it can escape initially problematic sign configurations.

通过理解学习率重置在结构和参数学习上的优势，我们能更接近设计更灵活的深度学习算法，能优化各种稀疏架构的集合。

口罩、标志与学习率重置

Masks, Signs, And Learning Rate Rewinding

Given the ever-increasing size of modern neural networks, the significance of
sparse architectures has surged due to their accelerated inference speeds and
minimal memory demands. When it comes to global pruning techniques, Iterative
Magnitude Pruning (IMP) still stands as a state-of-the-art algorithm despite
its simple nature, particularly in extremely sparse regimes. In light of the
recent finding that the two successive matching IMP solutions are linearly
connected without a loss barrier, we propose Sparse Weight Averaging with
Multiple Particles (SWAMP), a straightforward modification of IMP that achieves
performance comparable to an ensemble of two IMP solutions. For every
iteration, we concurrently train multiple sparse models, referred to as
particles, using different batch orders yet the same matching ticket, and then
weight average such models to produce a single mask. We demonstrate that our
method consistently outperforms existing baselines across different sparsities
through extensive experiments on various data and neural network structures.

本文提出一种基于迭代幅值修剪 (Iterative Magnitude Pruning, IMP) 算法的改进方法 Sparse Weight Averaging with Multiple Particles (SWAMP)，通过同时训练多个稀疏模型，利用加权平均的方式获得更好的泛化表现，该方法相比于现有基线方法在不同稀疏度下表现更优。

SWAMP: 迭代幅值削减的多粒子稀疏权重平均

SWAMP: Sparse Weight Averaging with Multiple Particles for Iterative  Magnitude Pruning

Neural architecture search (NAS) has demonstrated promising results on
identifying efficient Transformer architectures which outperform manually
designed ones for natural language tasks like neural machine translation (NMT).
Existing NAS methods operate on a space of dense architectures, where all of
the sub-architecture weights are activated for every input. Motivated by the
recent advances in sparsely activated models like the Mixture-of-Experts (MoE)
model, we introduce sparse architectures with conditional computation into the
NAS search space. Given this expressive search space which subsumes prior
densely activated architectures, we develop a new framework AutoMoE to search
for efficient sparsely activated sub-Transformers. AutoMoE-generated sparse
models obtain (i) 3x FLOPs reduction over manually designed dense Transformers
and (ii) 23% FLOPs reduction over state-of-the-art NAS-generated dense
sub-Transformers with parity in BLEU score on benchmark datasets for NMT.
AutoMoE consists of three training phases: (a) Heterogeneous search space
design with dense and sparsely activated Transformer modules (e.g., how many
experts? where to place them? what should be their sizes?); (b) SuperNet
training that jointly trains several subnetworks sampled from the large search
space by weight-sharing; (c) Evolutionary search for the architecture with the
optimal trade-off between task performance and computational constraint like
FLOPs and latency. AutoMoE code, data and trained models are available at
this https URL.

AutoMoE 利用罕见专家模型搜索出高效迪凡助手模型，较手动设计的模型可减少 3 倍的 FLOPs，与最先进的 NAS 生成的难点模型相比，可减少 23% 的 FLOPs，同时在 NMT 基准数据集上保持 BLEU 评分的平价。

AutoMoE: 针对高效稀疏激活 Transformer 的神经结构搜索

AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers

Multi-task learning with an unbalanced data distribution skews model learning
towards high resource tasks, especially when model capacity is fixed and fully
shared across all tasks. Sparse scaling architectures, such as BASELayers,
provide flexible mechanisms for different tasks to have a variable number of
parameters, which can be useful to counterbalance skewed data distributions. We
find that that sparse architectures for multilingual machine translation can
perform poorly out of the box, and propose two straightforward techniques to
mitigate this - a temperature heating mechanism and dense pre-training.
Overall, these methods improve performance on two multilingual translation
benchmarks compared to standard BASELayers and Dense scaling baselines, and in
combination, more than 2x model convergence speed.

本文提出了使用 BASELayers 的稀疏缩放架构来缓解多任务学习中高资源任务偏差的问题，并通过温度加热机制和密集预训练两种技术来提高多语言机器翻译的性能。该方法在两个多语言翻译基准测试中的收敛速度比标准的 BASELayers 和密集缩放基线快了两倍以上。

训练稀疏翻译模型的技巧

Tricks for Training Sparse Translation Models

Overparameterized Neural Networks (NN) display state-of-the-art performance.
However, there is a growing need for smaller, energy-efficient, neural networks
tobe able to use machine learning applications on devices with limited
computational resources. A popular approach consists of using pruning
techniques. While these techniques have traditionally focused on pruning
pre-trained NN (LeCun et al.,1990; Hassibi et al., 1993), recent work by Lee et
al. (2018) has shown promising results when pruning at initialization. However,
for Deep NNs, such procedures remain unsatisfactory as the resulting pruned
networks can be difficult to train and, for instance, they do not prevent one
layer from being fully pruned. In this paper, we provide a comprehensive
theoretical analysis of Magnitude and Gradient based pruning at initialization
and training of sparse architectures. This allows us to propose novel
principled approaches which we validate experimentally on a variety of NN
architectures.

深度神经网络修剪方法的全面理论分析及其在各种网络架构上的实验验证。