To democratize large language models (LLMs) to most natural languages, it is
imperative to make these models capable of understanding and generating texts
in many languages, in particular low-resource ones. While recent multilingual
LLMs demonstrate remarkable performance in such capabilities, these LLMs still
support a limited number of human languages due to the lack of training data
for low-resource languages. Moreover, these LLMs are not yet aligned with human
preference for downstream tasks, which is crucial for the success of LLMs in
English. In this paper, we introduce xLLaMA-100 and xBLOOM-100 (collectively
xLLMs-100), which scale the multilingual capabilities of LLaMA and BLOOM to 100
languages. To do so, we construct two datasets: a multilingual instruction
dataset including 100 languages, which represents the largest language coverage
to date, and a cross-lingual human feedback dataset encompassing 30 languages.
We perform multilingual instruction tuning on the constructed instruction data
and further align the LLMs with human feedback using the DPO algorithm on our
cross-lingual human feedback dataset. We evaluate the multilingual
understanding and generating capabilities of xLLMs-100 on five multilingual
benchmarks. Experimental results show that xLLMs-100 consistently outperforms
its peers across the benchmarks by considerable margins, defining a new
state-of-the-art multilingual LLM that supports 100 languages.

通过构建两个数据集，将 LLaMA 和 BLOOM 的多语言能力扩展到 100 种语言，并使用 DPO 算法对 LLMs 进行与人类反馈的对齐，实现了对 100 种语言的支持，从而定义了最新的、支持 100 种语言的多语言 LLMs 的最新技术。

超越英语的 LLMs：通过跨语言反馈扩展 LLMs 的多语言能力

LLMs Beyond English: Scaling the Multilingual Capability of LLMs with  Cross-Lingual Feedback

Although the capabilities of large language models (LLMs) ideally scale up
with increasing data and compute, they are inevitably constrained by limited
resources in reality. Suppose we have a moderately trained LLM (e.g., trained
to align with human preference) in hand, can we further exploit its potential
and cheaply acquire a stronger model? In this paper, we propose a simple method
called ExPO to boost LLMs' alignment with human preference. ExPO assumes that a
medium-aligned model can be interpolated between a less-aligned (weaker) model,
e.g., the initial SFT model, and a better-aligned (stronger) one, thereby
directly obtaining this stronger model by extrapolating from the weights of the
former two relatively weaker models. On the AlpacaEval 2.0 benchmark, we show
that ExPO pushes models trained with less preference data (e.g., 10% or 20%) to
reach and even surpass the fully-trained one, without any additional training.
Furthermore, ExPO also significantly improves off-the-shelf DPO/RLHF models and
exhibits decent scalability across model sizes from 7B to 70B. Our work
demonstrates the efficacy of model extrapolation in exploiting LLMs'
capabilities, suggesting a promising direction that deserves future
exploration.

通过 ExPO 方法，我们展示了将训练数据较少的模型推向或超越完全训练模型的可能性，同时在不同模型规模上显示出合理的可伸缩性，这表明模型外推在发掘大型语言模型能力方面具有潜力，值得未来探索。