Preference optimization techniques have become a standard final stage for training state-of-art large language models (LLMs). However, despite widespread adoption, the vast majority of work to-date has focused on first-class citizen languages like English and Chinese. This captures a small fraction of the languages in the world, but also makes it unclear which aspects of current state-of-the-art research transfer to a multilingual setting. In this work, we perform an exhaustive study to achieve a new state-of-the-art in aligning multilingual LLMs. We introduce a novel, scalable method for generating high-quality multilingual feedback data to balance data coverage. We establish the benefits of cross-lingual transfer and increased dataset size in preference training. Our preference-trained model achieves a 54.4% win-rate against Aya 23 8B, the current state-of-the-art multilingual LLM in its parameter class, and a 69.5% win-rate or higher against widely used models like Gemma-1.1-7B-it, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3. As a result of our study, we expand the frontier of alignment techniques to 23 languages covering half of the world's population.

通过对23种语言进行详尽的研究，我们的工作在多语言大型语言模型对齐领域取得了新的最优成果，通过引入一种新颖可扩展的方法生成高质量多语言反馈数据以平衡数据覆盖，我们得到了优势训练模型，其在参数类别中击败了当前最先进的多语言大型语言模型，并在广泛使用的模型中取得了69.5%或更高的优势胜率，将对齐技术的边界扩展到全球人口的一半所涵盖的23种语言。

RLHF 能够说多种语言：解锁面向LLMs的多语言偏好优化