The increasing complexity of modern deep neural network models and the
expanding sizes of datasets necessitate the development of optimized and
scalable training methods. In this white paper, we addressed the challenge of
efficiently training neural network models using sequences of varying sizes. To
address this challenge, we propose a novel training scheme that enables
efficient distributed data-parallel training on sequences of different sizes
with minimal overhead. By using this scheme we were able to reduce the padding
amount by more than 100$x$ while not deleting a single frame, resulting in an
overall increased performance on both training time and Recall in our
experiments.

使用新的训练方案，能够在不同大小的序列上实现高效的分布式数据并行训练，最小化内存消耗，并在实验中取得了整体性能的提升。

BLoad：增强神经网络训练的高效顺序数据处理

BLoad: Enhancing Neural Network Training with Efficient Sequential Data  Handling

Distributed data-parallel (DDP) training improves overall application
throughput as multiple devices train on a subset of data and aggregate updates
to produce a globally shared model. The periodic synchronization at each
iteration incurs considerable overhead, exacerbated by the increasing size and
complexity of state-of-the-art neural networks. Although many gradient
compression techniques propose to reduce communication cost, the ideal
compression factor that leads to maximum speedup or minimum data exchange
remains an open-ended problem since it varies with the quality of compression,
model size and structure, hardware, network topology and bandwidth. We propose
GraVAC, a framework to dynamically adjust compression factor throughout
training by evaluating model progress and assessing gradient information loss
associated with compression. GraVAC works in an online, black-box manner
without any prior assumptions about a model or its hyperparameters, while
achieving the same or better accuracy than dense SGD (i.e., no compression) in
the same number of iterations/epochs. As opposed to using a static compression
factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM
by 4.32x, 1.95x and 6.67x respectively. Compared to other adaptive schemes, our
framework provides 1.94x to 5.63x overall speedup.

本文介绍了 GraVAC，一种动态调整压缩因子的框架，用于在分布式数据并行训练中降低通信开销并提高训练速度。GraVAC 可以根据模型进展和梯度信息损失自适应地进行压缩，相较于静态压缩因子，可以将 ResNet101，VGG16 和 LSTM 的端到端训练时间分别缩短 4.32x，1.95x 和 6.67x，相较于其他自适应方案，整体加速比可达 1.94x 至 5.63x。

GraVAC：通信高效的分布式深度学习训练自适应压缩

GraVAC: Adaptive Compression for Communication-Efficient Distributed DL  Training

Compressed communication, in the form of sparsification or quantization of
stochastic gradients, is employed to reduce communication costs in distributed
data-parallel training of deep neural networks. However, there exists a
discrepancy between theory and practice: while theoretical analysis of most
existing compression methods assumes compression is applied to the gradients of
the entire model, many practical implementations operate individually on the
gradients of each layer of the model. In this paper, we prove that layer-wise
compression is, in theory, better, because the convergence rate is upper
bounded by that of entire-model compression for a wide range of biased and
unbiased compression methods. However, despite the theoretical bound, our
experimental study of six well-known methods shows that convergence, in
practice, may or may not be better, depending on the actual trained model and
compression ratio. Our findings suggest that it would be advantageous for deep
learning frameworks to include support for both layer-wise and entire-model
compression.

通过实验和理论分析，本文表明在深度神经网络的分布式数据并行训练中，面向单个层的压缩比面向整个模型的压缩方式更好，但实验也显示，具体训练模型和压缩率都可能导致实际收敛率的变化。因此，本文建议深度学习框架应支持面向单个层和整个模型的压缩方式。