We study the effect of one type of imbalance often present in real-life
multilingual classification datasets: an uneven distribution of labels across
languages. We show evidence that fine-tuning a transformer-based Large Language
Model (LLM) on a dataset with this imbalance leads to worse performance, a more
pronounced separation of languages in the latent space, and the promotion of
uninformative features. We modify the traditional class weighing approach to
imbalance by calculating class weights separately for each language and show
that this helps mitigate those detrimental effects. These results create
awareness of the negative effects of language-specific class imbalance in
multilingual fine-tuning and the way in which the model learns to rely on the
separation of languages to perform the task.

我们研究了现实生活中多语种分类数据集中常见的一种不平衡现象：标签在不同语言之间的分布不均。我们通过证据表明，在这种不平衡的数据集上微调基于 Transformer 的 Large Language Model（LLM）会导致性能下降、隐空间中语言间的差异更加明显以及非信息性特征的促进。我们修改了传统的类别加权方法，通过为每种语言单独计算类别权重来缓解这些不利影响。这些结果引起人们对多语种微调中语言特定类别不平衡带来的负面影响以及模型在执行任务时依赖语言分离的认识。

多语言微调中语言特定类别不平衡的影响理解

Understanding the effects of language-specific class imbalance in  multilingual fine-tuning

The self-attention mechanism sets transformer-based large language model
(LLM) apart from the convolutional and recurrent neural networks. Despite the
performance improvement, achieving real-time LLM inference on silicon is
challenging due to the extensively used Softmax in self-attention. Apart from
the non-linearity, the low arithmetic intensity greatly reduces the processing
parallelism, which becomes the bottleneck especially when dealing with a longer
context. To address this challenge, we propose Constant Softmax (ConSmax), a
software-hardware co-design as an efficient Softmax alternative. ConSmax
employs differentiable normalization parameters to remove the maximum searching
and denominator summation in Softmax. It allows for massive parallelization
while performing the critical tasks of Softmax. In addition, a scalable ConSmax
hardware utilizing a bitwidth-split look-up table (LUT) can produce lossless
non-linear operation and support mix-precision computing. It further
facilitates efficient LLM inference. Experimental results show that ConSmax
achieves a minuscule power consumption of 0.43 mW and area of 0.001 mm2 at
1-GHz working frequency and 22-nm CMOS technology. Compared to state-of-the-art
Softmax hardware, ConSmax results in 14.5x energy and 14.0x area savings with a
comparable accuracy on a GPT-2 model and the WikiText103 dataset.

提出了一种有效的自注意机制替代方案 ConSmax，通过可扩展硬件和可微分参数实现大规模并行计算，以实现基于 Transformer 的大型语言模型的实时推理，并取得比现有方案更好的能源和面积性能。

ConSmax：硬件友好的可学习参数替代 Softmax

ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters

Computation in a typical Transformer-based large language model (LLM) can be
characterized by batch size, hidden dimension, number of layers, and sequence
length. Until now, system works for accelerating LLM training have focused on
the first three dimensions: data parallelism for batch size, tensor parallelism
for hidden size and pipeline parallelism for model depth or layers. These
widely studied forms of parallelism are not targeted or optimized for long
sequence Transformer models. Given practical application needs for long
sequence LLM, renewed attentions are being drawn to sequence parallelism.
However, existing works in sequence parallelism are constrained by
memory-communication inefficiency, limiting their scalability to long sequence
large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable
and effective methodology for enabling highly efficient and scalable LLM
training with extremely long sequence length. DeepSpeed-Ulysses at its core
partitions input data along the sequence dimension and employs an efficient
all-to-all collective communication for attention computation. Theoretical
communication analysis shows that whereas other methods incur communication
overhead as sequence length increases, DeepSpeed-Ulysses maintains constant
communication volume when sequence length and compute devices are increased
proportionally. Furthermore, experimental evaluations show that
DeepSpeed-Ulysses trains 2.5X faster with 4X longer sequence length than the
existing method SOTA baseline.

DeepSpeed-Ulysses 是一种新颖、可移植和有效的方法，用于实现高效且可扩展的长序列大型语言模型的训练，包括在序列维度上划分输入数据和使用高效的全互联通信进行注意力计算。实验评估结果显示，DeepSpeed-Ulysses 在 4 倍较长的序列长度下比现有方法提供了 2.5 倍的训练速度。