Though reasoning abilities are considered language-agnostic, existing LLMs
exhibit inconsistent reasoning abilities across different languages, e.g.,
reasoning in a pivot language is superior to other languages due to the
imbalance of multilingual training data.To enhance reasoning abilities in
non-pivot languages, we propose an alignment-as-preference optimization
framework. Specifically, we adopt an open-source translation model to estimate
the consistency between answers in non-pivot and pivot languages. We further
adopt the answer consistency as the preference for DPO or PPO thus optimizing
the lesser reasoning. Experiments show that our method significantly improves
the model's multilingual reasoning, with better reasoning consistency across
languages. Our framework achieved a 13.7% accuracy improvement on out-of-domain
datasets MSVAMP while preserving the competitive performance on MGSM. Moreover,
we find that iterative DPO is helpful for further alignment and improvement of
the model's multilingual mathematical reasoning ability, further pushing the
improvement to 16.7%

通过采用一种对齐作为优选优化框架，我们在非中心语言中提高了推理能力，推理一致性得到了改善，并通过迭代 DPO 进一步优化了模型的多语言数学推理能力。