Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License.

BLOOM 是一个 176B 参数的解码器 - 只有 Transformer 语言模型，它使用 ROOTS 语料库进行训练，并在多任务提示微调后达到了竞争力强的结果。该研究呼吁公开此类研究并在负责任的 AI 许可下发布其模型和代码，以便未来的研究和应用。

BLOOM: 一种含 176B 个参数的多语言开放访问语言模型

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Multitask prompted finetuning (MTF) has been shown to help large language
models generalize to new tasks in a zero-shot setting, but so far explorations
of MTF have focused on English data and models. We apply MTF to the pretrained
multilingual BLOOM and mT5 model families to produce finetuned variants called
BLOOMZ and mT0. We find finetuning large multilingual language models on
English tasks with English prompts allows for task generalization to
non-English languages that appear only in the pretraining corpus. Finetuning on
multilingual tasks with English prompts further improves performance on English
and non-English tasks leading to various state-of-the-art zero-shot results. We
also investigate finetuning on multilingual tasks with prompts that have been
machine-translated from English to match the language of each dataset. We find
training on these machine-translated prompts leads to better performance on
human-written prompts in the respective languages. Surprisingly, we find models
are capable of zero-shot generalization to tasks in languages they have never
intentionally seen. We conjecture that the models are learning higher-level
capabilities that are both task- and language-agnostic. In addition, we
introduce xP3, a composite of supervised datasets in 46 languages with English
and machine-translated prompts. Our code, datasets and models are publicly
available at this https URL.

研究发现多任务 finetuning 可以帮助大型多语言模型成功推广至非英语任务中，并且使用机器翻译英文为前缀可以获得更好的性能，最终实现零 - shot 的结果。

多任务微调实现跨语言通用化

Crosslingual Generalization through Multitask Finetuning

Large pretrained Transformer language models have been shown to exhibit
zero-shot generalization, i.e. they can perform a wide variety of tasks that
they were not explicitly trained on. However, the architectures and pretraining
objectives used across state-of-the-art models differ significantly, and there
has been limited systematic comparison of these factors. In this work, we
present a large-scale evaluation of modeling choices and their impact on
zero-shot generalization. In particular, we focus on text-to-text models and
experiment with three model architectures (causal/non-causal decoder-only and
encoder-decoder), trained with two different pretraining objectives
(autoregressive and masked language modeling), and evaluated with and without
multitask prompted finetuning. We train models with over 5 billion parameters
for more than 170 billion tokens, thereby increasing the likelihood that our
conclusions will transfer to even larger scales. Our experiments show that
causal decoder-only models trained on an autoregressive language modeling
objective exhibit the strongest zero-shot generalization after purely
unsupervised pretraining. However, models with non-causal visibility on their
input trained with a masked language modeling objective followed by multitask
finetuning perform the best among our experiments. We therefore consider the
adaptation of pretrained models across architectures and objectives. We find
that pretrained non-causal decoder models can be adapted into performant
generative causal decoder models, using autoregressive language modeling as a
downstream task. Furthermore, we find that pretrained causal decoder models can
be efficiently adapted into non-causal decoder models, ultimately achieving
competitive performance after multitask finetuning. Code and checkpoints are
available at this https URL.

通过大规模模型比较和实验验证，本文发现预训练的 Transformer 模型在自然语言处理任务的零样本泛化能力中，部分结构和预训练目标优于其他模型，这为模型架构和目标选择提供了指导。同时，本文研究了预训练模型跨结构和目标的迁移，并提供源代码和检查点。