In this paper, we propose a two-phase training approach where pre-trained
large language models are continually pre-trained on parallel data and then
supervised fine-tuned with a small amount of high-quality parallel data. To
investigate the effectiveness of our proposed approach, we conducted continual
pre-training with a 3.8B-parameter model and parallel data across eight
different formats. We evaluate these methods on thirteen test sets for
Japanese-to-English and English-to-Japanese translation. The results
demonstrate that when utilizing parallel data in continual pre-training, it is
essential to alternate between source and target sentences. Additionally, we
demonstrated that the translation accuracy improves only for translation
directions where the order of source and target sentences aligns between
continual pre-training data and inference. In addition, we demonstrate that the
LLM-based translation model is more robust in translating spoken language and
achieves higher accuracy with less training data compared to supervised
encoder-decoder models. We also show that the highest accuracy is achieved when
the data for continual pre-training consists of interleaved source and target
sentences and when tags are added to the source sentences.

通过两阶段训练方法，即不断在并行数据上预训练大型语言模型并在少量高质量并行数据上进行有监督微调，我们证明了这种方法的有效性。我们的研究表明，在并行数据的持续预训练中，在源句和目标句之间交替使用是至关重要的。此外，我们还证明了基于 LLM 的翻译模型在口语语言翻译中更加稳健，在使用更少的训练数据时可以达到更高的准确性，相较于有监督的编码器 - 解码器模型。最高的准确性在于持续预训练数据包括交替的源句和目标句以及在源句中添加标签时实现。

通过持续预训练并行数据提高大型语言模型的翻译准确性

Enhancing Translation Accuracy of Large Language Models through  Continual Pre-Training on Parallel Data

Protolanguage reconstruction is central to historical linguistics. The
comparative method, one of the most influential theoretical and methodological
frameworks in the history of the language sciences, allows linguists to infer
protoforms (reconstructed ancestral words) from their reflexes (related modern
words) based on the assumption of regular sound change. Not surprisingly,
numerous computational linguists have attempted to operationalize comparative
reconstruction through various computational models, the most successful of
which have been supervised encoder-decoder models, which treat the problem of
predicting protoforms given sets of reflexes as a sequence-to-sequence problem.
We argue that this framework ignores one of the most important aspects of the
comparative method: not only should protoforms be inferable from cognate sets
(sets of related reflexes) but the reflexes should also be inferable from the
protoforms. Leveraging another line of research -- reflex prediction -- we
propose a system in which candidate protoforms from a reconstruction model are
reranked by a reflex prediction model. We show that this more complete
implementation of the comparative method allows us to surpass state-of-the-art
protoform reconstruction methods on three of four Chinese and Romance datasets.

通过运用反射预测模型对重构模型中的候选原型进行重新排序，我们的研究在三个中国和罗曼语数据集中超越了最先进的原型重建方法。