Recently developed large language models (LLMs) such as ChatGPT, Claude, and
Llama have demonstrated impressive abilities, and even surpass human-level
performance in several tasks. Despite their success, the resource-intensive
demands of these models, requiring significant computational power for both
training and inference, limit their deployment to high-performance servers.
Additionally, the extensive calculation requirements of the models often lead
to increased latency in response times. With the increasing need for LLMs to
operate efficiently on CPUs, research about lightweight models that are
optimized for CPU inference has emerged. In this work, we introduce GEB-1.3B, a
lightweight LLM trained on 550 billion tokens in both Chinese and English
languages. We employ novel training techniques, including ROPE,
Group-Query-Attention, and FlashAttention-2, to accelerate training while
maintaining model performance. Additionally, we fine-tune the model using 10
million samples of instruction data to enhance alignment. GEB-1.3B exhibits
outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU,
outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B.
Notably, the FP32 version of GEB-1.3B achieves commendable inference times on
CPUs, with ongoing efforts to further enhance speed through advanced
quantization techniques. The release of GEB-1.3B as an open-source model marks
a significant contribution to the development of lightweight LLMs, promising to
foster further research and innovation in the field.

最近发展的大型语言模型（LLMs）（如 ChatGPT、Claude 和 Llama）展示了惊人的能力，甚至在多项任务中超越了人类水平。然而，这些模型对资源的需求在训练和推断方面都需要大量的计算能力，限制了它们应用于高性能服务器。鉴于在 CPU 上高效运行 LLMs 的需求日益增长，我们介绍了 GEB-1.3B，一个在中文和英文语言中训练了 5500 亿标记的轻量级 LLM。我们采用了一些新的训练技术，包括 ROPE、Group-Query-Attention 和 FlashAttention-2，以加速训练同时保持模型的性能。此外，我们使用了 1000 万条指示数据样本对模型进行了微调以提高对齐度。GEB-1.3B 在 MMLU、C-Eval 和 CMMLU 等常规基准测试中表现出色，优于 MindLLM-1.3B 和 TinyLLaMA-1.1B 等对比模型。值得注意的是，GEB-1.3B 的 FP32 版本在 CPU 上具有可嘉的推断时间，正在进行先进的量化技术来进一步提高速度。GEB-1.3B 作为一个开源模型的发布对于轻量级 LLMs 的发展具有重要意义，有望促进该领域的进一步研究和创新。

GEB-1.3B：开放轻量级大型语言模型

GEB-1.3B: Open Lightweight Large Language Model

While GPU clusters are the de facto choice for training large deep neural
network (DNN) models today, several reasons including ease of workflow,
security and cost have led to efforts investigating whether CPUs may be viable
for inference in routine use in many sectors of the industry. But the imbalance
between the compute capabilities of GPUs and CPUs is huge. Motivated by these
considerations, we study a module which is a workhorse within modern DNN
architectures, GEMM based Feed Forward Networks (FFNs), and assess the extent
to which it can be made compute- (or FLOP-) lite. Specifically, we propose an
alternative formulation (we call it LookupFFN) to GEMM based FFNs inspired by
the recent studies of using Locality Sensitive Hashing (LSH) to approximate
FFNs. Our formulation recasts most essential operations as a memory look-up,
leveraging the trade-off between the two resources on any platform: compute and
memory (since CPUs offer it in abundance). For RoBERTa language model
pretraining, our formulation achieves similar performance compared to GEMM
based FFNs, while dramatically reducing the required FLOP. Our development is
complemented with a detailed hardware profiling of strategies that will
maximize efficiency -- not just on contemporary hardware but on products that
will be offered in the near/medium term future. Code is avaiable at
https://github.com/mlpen/LookupFFN.

通过研究 GEMM 基于前馈网络（FFN）的模块，我们提出了一种替代方案（称之为 LookupFFN），将大多数关键操作转化为内存查找，以减少所需的 FLOP，从而在 RoBERTa 语言模型预训练中实现类似性能。

LookupFFN: 让 Transformer 在 CPU 推理中计算更轻巧

LookupFFN: Making Transformers Compute-lite for CPU inference

In this paper, we address the problem of reducing the memory footprint of
convolutional network architectures. We introduce a vector quantization method
that aims at preserving the quality of the reconstruction of the network
outputs rather than its weights. The principle of our approach is that it
minimizes the loss reconstruction error for in-domain inputs. Our method only
requires a set of unlabelled data at quantization time and allows for efficient
inference on CPU by using byte-aligned codebooks to store the compressed
weights. We validate our approach by quantizing a high performing ResNet-50
model to a memory size of 5MB (20x compression factor) while preserving a top-1
accuracy of 76.1% on ImageNet object classification and by compressing a Mask
R-CNN with a 26x factor.

本文提出一种矢量量化方法，以减小卷积神经网络架构的存储占用，能以较小的内存占用提供高精度的图像识别。