Despite rapid progress in large language models (LLMs), their performance on
a vast majority of languages remain unsatisfactory. In this paper, we study
building language-specific LLMs by adapting monolingual and multilingual LLMs.
We conduct systematic experiments on how design choices (base model selection,
vocabulary extension, and continued fine-tuning) impact the adapted LLM, both
in terms of efficiency (how many tokens are needed to encode the same amount of
information) and end task performance. We find that (1) the initial performance
before the adaptation is not always indicative of the final performance. (2)
Efficiency can easily improved with simple vocabulary extension and continued
fine-tuning in most LLMs we study, and (3) The optimal adaptation method is
highly language-dependent, and the simplest approach works well across various
experimental settings. Adapting English-centric models can yield better results
than adapting multilingual models despite their worse initial performance on
low-resource languages. Together, our work lays foundations on efficiently
building language-specific LLMs by adapting existing LLMs.

通过对已有大语言模型进行适应和扩展，我们研究了构建语言专属的大语言模型。我们通过系统实验探究基础模型选择、词汇扩展和持续微调等设计选择对适应后的大语言模型的效率（编码同样数量信息所需的词汇数）和最终任务性能的影响。我们发现，（1）适应前的初始性能并不总是最终性能的指示；（2）大多数研究的大语言模型可以通过简单的词汇扩展和持续微调来提高效率；（3）最佳的适应方法高度依赖于语言，简单的方法在各种实验设置中都表现良好。与适应多语言模型相比，适应以英语为中心的模型在资源稀缺语言上可以取得更好的结果。总之，我们的工作为通过适应现有大语言模型高效构建语言专属大语言模型奠定了基础。

探索构建语言特定的 LLM 的设计选择

Exploring Design Choices for Building Language-Specific LLMs

Large language models (LMs) are able to in-context learn -- perform a new
task via inference alone by conditioning on a few input-label pairs
(demonstrations) and making predictions for new inputs. However, there has been
little understanding of how the model learns and which aspects of the
demonstrations contribute to end task performance. In this paper, we show that
ground truth demonstrations are in fact not required -- randomly replacing
labels in the demonstrations barely hurts performance on a range of
classification and multi-choce tasks, consistently over 12 different models
including GPT-3. Instead, we find that other aspects of the demonstrations are
the key drivers of end task performance, including the fact that they provide a
few examples of (1) the label space, (2) the distribution of the input text,
and (3) the overall format of the sequence. Together, our analysis provides a
new way of understanding how and why in-context learning works, while opening
up new questions about how much can be learned from large language models
through inference alone.

本研究分析显示：大型语言模型不需要准确的演示，而是通过演示提供的标签空间、输入文本的分布和序列的整体格式等方面驱动任务表现的提高。因此，揭示了语境学习的原理和作用方式，同时提出了新的问题，即能否仅仅通过推理来学习大型语言模型的更多内容。

重新思考演示的作用：何为情境学习的关键？

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

ALFRED is a recently proposed benchmark that requires a model to complete
tasks in simulated house environments specified by instructions in natural
language. We hypothesize that key to success is accurately aligning the text
modality with visual inputs. Motivated by this, we inspect how well existing
models can align these modalities using our proposed intrinsic metric, boundary
adherence score (BAS). The results show the previous models are indeed failing
to perform proper alignment. To address this issue, we introduce approaches
aimed at improving model alignment and demonstrate how improved alignment,
improves end task performance.

本文章研究 ALFRED 智能家居环境下的任务完成问题，提出对齐文本和视觉输入是成功的关键，通过提出的度量标准 border adherence score (BAS) 检查现有模型的文本和视觉对齐效果，并提出改进的方法，最终实现了模型对齐和任务性能的提高。