Traditional end-to-end (E2E) training of deep networks necessitates storing
intermediate activations for back-propagation, resulting in a large memory
footprint on GPUs and restricted model parallelization. As an alternative,
greedy local learning partitions the network into gradient-isolated modules and
trains supervisely based on local preliminary losses, thereby providing
asynchronous and parallel training methods that substantially reduce memory
cost. However, empirical experiments reveal that as the number of segmentations
of the gradient-isolated module increases, the performance of the local
learning scheme degrades substantially, severely limiting its expansibility. To
avoid this issue, we theoretically analyze the greedy local learning from the
standpoint of information theory and propose a ContSup scheme, which
incorporates context supply between isolated modules to compensate for
information loss. Experiments on benchmark datasets (i.e. CIFAR, SVHN, STL-10)
achieve SOTA results and indicate that our proposed method can significantly
improve the performance of greedy local learning with minimal memory and
computational overhead, allowing for the boost of the number of isolated
modules. Our codes are available at this https URL

从信息论的角度出发，我们在贪婪的局部学习中提出了一种 ContSup 方案，它将隔离模块之间的上下文补充作为补偿信息损失的手段。在基准数据集（即 CIFAR、SVHN、STL-10）上的实验证明，我们提出的方法能够显著提高贪婪的局部学习性能，并且只带来最小的内存和计算开销，可以增加隔离模块的数量。

超越端到端训练：以上下文补充提升贪婪局部学习

Go beyond End-to-End Training: Boosting Greedy Local Learning with  Context Supply

State space models (SSMs) have shown impressive results on tasks that require
modeling long-range dependencies and efficiently scale to long sequences owing
to their subquadratic runtime complexity. Originally designed for continuous
signals, SSMs have shown superior performance on a plethora of tasks, in vision
and audio; however, SSMs still lag Transformer performance in Language Modeling
tasks. In this work, we propose a hybrid layer named Block-State Transformer
(BST), that internally combines an SSM sublayer for long-range
contextualization, and a Block Transformer sublayer for short-term
representation of sequences. We study three different, and completely
parallelizable, variants that integrate SSMs and block-wise attention. We show
that our model outperforms similar Transformer-based architectures on language
modeling perplexity and generalizes to longer sequences. In addition, the
Block-State Transformer demonstrates more than tenfold increase in speed at the
layer level compared to the Block-Recurrent Transformer when model
parallelization is employed.

本文介绍了一种名为 Block-State Transformer (BST) 的混合层，它在内部组合了用于长距离上下文建模的 SSM 子层和用于序列的短期表示的 Block Transformer 子层，并研究了三种完全可并行化的 SSM 和块状注意力的集成变体。我们证明了我们的模型在语言模型困惑度上胜过类似的基于 Transformer 的架构，并可以推广到更长的序列。另外，Block-State Transformer 在模型并行化时的层级速度比 Block-Recurrent Transformer 快了十倍以上。

块状态变换器

Block-State Transformer

Modern neural networks require long training to reach decent performance on
massive datasets. One common approach to speed up training is model
parallelization, where large neural networks are split across multiple devices.
However, different device placements of the same neural network lead to
different training times. Most of the existing device placement solutions treat
the problem as sequential decision-making by traversing neural network graphs
and assigning their neurons to different devices. This work studies the impact
of graph traversal order on device placement. In particular, we empirically
study how different graph traversal order leads to different device placement,
which in turn affects the training execution time. Our experiment results show
that the best graph traversal order depends on the type of neural networks and
their computation graphs features. In this work, we also provide
recommendations on choosing graph traversal order in device placement for
various neural network families to improve the training time in model
parallelization.

本研究探讨了神经网络图遍历顺序对设备部署的影响，特别是在模型并行化中如何选择最佳遍历顺序，以提高不同神经网络家族的训练时间。