The ubiquity of vision transformers (ViTs) for various edge applications,
including personalized learning, has created the demand for on-device
fine-tuning. However, training with the limited memory and computation power of
edge devices remains a significant challenge. In particular, the memory
required for training is much higher than that needed for inference, primarily
due to the need to store activations across all layers in order to compute the
gradients needed for weight updates. Previous works have explored reducing this
memory requirement via frozen-weight training as well storing the activations
in a compressed format. However, these methods are deemed inefficient due to
their inability to provide training or inference speedup. In this paper, we
first investigate the limitations of existing on-device training methods aimed
at reducing memory and compute requirements. We then present block selective
reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a
pre-trained model and selectively drop tokens based on self-attention scores of
the frozen layers. To show the efficacy of BSR, we present extensive
evaluations on ViT-B and DeiT-S with five different datasets. Compared to the
existing alternatives, our approach simultaneously reduces training memory by
up to 1.4x and compute cost by up to 2x while maintaining similar accuracy. We
also showcase results for Mixture-of-Expert (MoE) models, demonstrating the
effectiveness of our approach in multitask learning scenarios.

通过研究现有的内置训练方法的局限性，本文提出了基于块选择性重编程（BSR）的方法，在部分冻结层的基础上，根据自注意力得分从预训练模型中选择性地丢弃令牌，有效地减少训练内存和计算成本，同时保持相似的准确性，适用于多任务学习场景。

基于块选择性重编程的视觉 Transformer 设备端训练

Block Selective Reprogramming for On-device Training of Vision  Transformers

State-space models (SSMs) have recently demonstrated competitive performance
to transformers at large-scale language modeling benchmarks while achieving
linear time and memory complexity as a function of sequence length. Mamba, a
recently released SSM model, shows impressive performance in both language
modeling and long sequence processing tasks. Simultaneously, mixture-of-expert
(MoE) models have shown remarkable performance while significantly reducing the
compute and latency costs of inference at the expense of a larger memory
footprint. In this paper, we present BlackMamba, a novel architecture that
combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate
that BlackMamba performs competitively against both Mamba and transformer
baselines, and outperforms in inference and training FLOPs. We fully train and
open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a
custom dataset. We show that BlackMamba inherits and combines both of the
benefits of SSM and MoE architectures, combining linear-complexity generation
from SSM with cheap and fast inference from MoE. We release all weights,
checkpoints, and inference code open-source. Inference code at:
this https URL

利用 Mamba SSM 和 MoE 相结合的新型架构 BlackMamba，在模型训练和推理 FLOPs 方面表现优秀，实现了 SSM 的线性复杂度生成和 MoE 快速高效推理的结合。

BlackMamba: 状态空间模型的专家混合

BlackMamba: Mixture of Experts for State-Space Models

One defining characteristic of Mixture-of-Expert (MoE) models is their
capacity for conducting sparse computation via expert routing, leading to
remarkable scalability. However, backpropagation, the cornerstone of deep
learning, requires dense computation, thereby posting challenges in MoE
gradient computations. Here, we introduce SparseMixer, a scalable gradient
estimator that bridges the gap between backpropagation and sparse expert
routing. Unlike typical MoE training which strategically neglects certain
gradient terms for the sake of sparse computation and scalability, SparseMixer
provides scalable gradient approximations for these terms, enabling reliable
gradient estimation in MoE training. Grounded in a numerical ODE framework,
SparseMixer harnesses the mid-point method, a second-order ODE solver, to
deliver precise gradient approximations with negligible computational overhead.
Applying SparseMixer to Switch Transformer on both pre-training and machine
translation tasks, SparseMixer showcases considerable performance gain,
accelerating training convergence up to 2 times.

通过 SparseMixer 建立了稀疏计算与反向传播之间的桥梁，提供可靠的梯度估计，并加速了 Switch Transformer 的训练收敛速度。