This paper explores cost-efficient methods to adapt pretrained Large Language
Models (LLMs) to new lower-resource languages, with a specific focus on
Estonian. Leveraging the Llama 2 model, we investigate the impact of combining
cross-lingual instruction-tuning with additional monolingual pretraining. Our
results demonstrate that even a relatively small amount of additional
monolingual pretraining followed by cross-lingual instruction-tuning
significantly enhances results on Estonian. Furthermore, we showcase
cross-lingual knowledge transfer from high-quality English instructions to
Estonian, resulting in improvements in commonsense reasoning and multi-turn
conversation capabilities. Our best model, named \textsc{Llammas}, represents
the first open-source instruction-following LLM for Estonian. Additionally, we
publish Alpaca-est, the first general task instruction dataset for Estonia.
These contributions mark the initial progress in the direction of developing
open-source LLMs for Estonian.

该研究探索了以成本效益的方法来适应新的低资源语言的事先训练的大型语言模型（LLMs），特别关注爱沙尼亚语。通过利用 Llama 2 模型，我们研究了将跨语言指令调整与额外的单语预训练相结合的影响。我们的结果表明，即使是相对较少的额外单语预训练再加上跨语言指令调整也能显著提高爱沙尼亚语的结果。此外，我们展示了从高质量的英文指令到爱沙尼亚语的跨语言知识转移，从而提高了常识推理和多轮对话能力。我们的最佳模型 	extsc {Llammas} 是首个适用于爱沙尼亚语的开源指令跟随 LLM。此外，我们发布了爱沙尼亚的第一个通用任务指令数据集 Alpaca-est。这些贡献标志着发展适用于爱沙尼亚语的开源 LLMs 的初步进展。

通过跨语言知识传递向羊驼教授一门新语言

Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer

As the capabilities of language models continue to advance, it is conceivable
that "one-size-fits-all" model will remain as the main paradigm. For instance,
given the vast number of languages worldwide, many of which are low-resource,
the prevalent practice is to pretrain a single model on multiple languages. In
this paper, we add to the growing body of evidence that challenges this
practice, demonstrating that monolingual pretraining on the target language
significantly improves models already extensively trained on diverse corpora.
More specifically, we further pretrain GPT-J and LLaMA models on Portuguese
texts using 3% or less of their original pretraining budget. Few-shot
evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models
outperform English-centric and multilingual counterparts by a significant
margin. Our best model, Sabi\'a-65B, performs on par with GPT-3.5-turbo. By
evaluating on datasets originally conceived in the target language as well as
translated ones, we study the contributions of language-specific pretraining in
terms of 1) capturing linguistic nuances and structures inherent to the target
language, and 2) enriching the model's knowledge about a domain or culture. Our
results indicate that the majority of the benefits stem from the
domain-specific knowledge acquired through monolingual pretraining.

在这篇论文中，我们证明了在目标语言上进行单语言预训练可以显著提高已经广泛训练于多个语料库的模型，并在 14 个葡萄牙语数据集上表现优于基于英语和多语言模型的模型。我们的结果表明，从单语预训练获得的大多数收益来自于领域特定知识。

Sabiá：葡萄牙语大语言模型

Sabiá: Portuguese Large Language Models

A major open problem in neural machine translation (NMT) is the translation
of idiomatic expressions, such as "under the weather". The meaning of these
expressions is not composed by the meaning of their constituent words, and NMT
models tend to translate them literally (i.e., word-by-word), which leads to
confusing and nonsensical translations. Research on idioms in NMT is limited
and obstructed by the absence of automatic methods for quantifying these
errors. In this work, first, we propose a novel metric for automatically
measuring the frequency of literal translation errors without human
involvement. Equipped with this metric, we present controlled translation
experiments with models trained in different conditions (with/without the
test-set idioms) and across a wide range of (global and targeted) metrics and
test sets. We explore the role of monolingual pretraining and find that it
yields substantial targeted improvements, even without observing any
translation examples of the test-set idioms. In our analysis, we probe the role
of idiom context. We find that the randomly initialized models are more local
or "myopic" as they are relatively unaffected by variations of the idiom
context, unlike the pretrained ones.

本文探讨了神经机器翻译中存在的习语翻译难题，提出了一个新的自动化量化习语翻译错误的方法，并通过对模型的训练以及不同翻译测试集的测试，探索了单语预训练和习语上下文因素对翻译质量的影响。