Aligning language models (LMs) based on human-annotated preference data is a
crucial step in obtaining practical and performant LM-based systems. However,
multilingual human preference data are difficult to obtain at scale, making it
challenging to extend this framework to diverse languages. In this work, we
evaluate a simple approach for zero-shot cross-lingual alignment, where a
reward model is trained on preference data in one source language and directly
applied to other target languages. On summarization and open-ended dialog
generation, we show that this method is consistently successful under
comprehensive evaluation settings, including human evaluation: cross-lingually
aligned models are preferred by humans over unaligned models on up to >70% of
evaluation instances. We moreover find that a different-language reward model
sometimes yields better aligned models than a same-language reward model. We
also identify best practices when there is no language-specific data for even
supervised finetuning, another component in alignment.

本研究探讨了一种简单的零 - shot 跨语言对齐方法，该方法基于偏好数据训练了一个奖励模型，在摘要生成和开放式对话生成任务中，经过全面的评估表明，这种方法在不同语言间的对齐中始终是成功的，包括人工评估：跨语言对齐模型在超过 70％的评估实例中优于未对齐模型。我们还发现，不同语言的奖励模型有时比相同语言的奖励模型具有更好的对齐效果，并且在没有语言特定数据的情况下进行有监督的微调也是对齐中的另一个重要组成部分。