Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource translation still lags significantly behind Neural Machine Translation (NMT) models. In this paper, we explore what it would take to adapt LLMs for low-resource settings. In particular, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has been shown to be less important for MT using LLMs than in previous MT research. Similarly, diversity during SFT has been shown to promote significant transfer in LLMs across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both of these considerations: a) parallel data is critical during both pretraining and SFT, and b) diversity tends to cause interference, not transfer. Our experiments, conducted with 3 LLMs across 2 low-resourced language groups - indigenous American and North-East Indian - reveal consistent patterns in both cases, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve lower-resource languages.

本文探讨了将大型语言模型（LLMs）适应低资源翻译所需的条件，重点分析了平行数据的重要性和监督微调中的多样性对性能的影响。研究发现，平行数据在预训练和微调过程中对低资源LLM-MT至关重要，而多样性往往导致干扰而非迁移。这些发现具有普遍性，对提升低资源语言的多语种LLM-MT模型具有重要价值。

质量还是数量？在低资源翻译中适应大型语言模型的数据规模和多样性