We investigate large language model performance across five orders of
magnitude of compute scaling in eleven recent model architectures. We show that
average benchmark performance, aggregating over many individual tasks and
evaluations as in the commonly-used BIG-Bench dataset, is decently predictable
as a function of training compute scale. Specifically, when extrapolating
BIG-Bench Hard performance across one order of magnitude in compute, we observe
average absolute errors of 6 percentage points (pp). By contrast, extrapolation
for individual BIG-Bench tasks across an order of magnitude in compute yields
higher average errors of 18pp. Nonetheless, individual task performance remains
significantly more predictable than chance. Overall, our work suggests compute
scaling provides a promising basis to forecast AI capabilities in diverse
benchmarks, though predicting performance in specific tasks poses challenges.

通过在 11 种最近的模型架构中研究大规模语言模型在五个数量级的计算规模上的表现，我们发现平均基准性能相当可预测，尽管在特定任务中的性能预测具有挑战性，因此计算规模提供了预测人工智能在不同基准测试中能力的有希望的基础。

语言模型基准测试的可预测性如何？

How predictable is language model benchmark performance?

We investigate the predictability of large language model (LLM) capabilities:
given records of past experiments using different model families, numbers of
parameters, tasks, and numbers of in-context examples, can we accurately
predict LLM performance on new experiment configurations? Answering this
question has practical implications for LLM users (e.g., deciding which models
to try), developers (e.g., prioritizing evaluation on representative tasks),
and the research community (e.g., identifying hard-to-predict capabilities that
warrant further investigation).
We study the performance prediction problem on experiment records from
BIG-bench. On a random train-test split, an MLP-based predictor achieves RMSE
below 5%, demonstrating the presence of learnable patterns within the
experiment records. Further, we formulate the problem of searching for
"small-bench," an informative subset of BIG-bench tasks from which the
performance of the full set can be maximally recovered, and find a subset as
informative for evaluating new model families as BIG-bench Hard, while being 3x
smaller.

研究了大型语言模型预测能力的可预测性问题并在 BIG-bench 实验记录上进行了实证研究，发现大型语言模型的性能可以以 5% 以下的 RMSE 进行准确预测，并提出了寻找一个信息性子集用于评估新模型家族的问题，整合了 BIG-bench Hard 的信息，并将规模缩小了三倍。

大型语言模型能力的可预测性研究 —— 以 BIG-bench 为例

How Predictable Are Large Language Model Capabilities? A Case Study on  BIG-bench

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that
focuses on tasks believed to be beyond the capabilities of current language
models. Language models have already made good progress on this benchmark, with
the best model in the BIG-Bench paper outperforming average reported
human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But
on what tasks do language models fall short of average human-rater performance,
and are those tasks actually unsolvable by current language models?
In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we
call BIG-Bench Hard (BBH). These are the task for which prior language model
evaluations did not outperform the average human-rater. We find that applying
chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the
average human-rater performance on 10 of the 23 tasks, and Codex
(code-davinci-002) to surpass the average human-rater performance on 17 of the
23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot
prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al.,
2022), substantially underestimates the best performance and capabilities of
language models, which is better captured via CoT prompting. As further
analysis, we explore the interaction between CoT and model scale on BBH,
finding that CoT enables emergent task performance on several BBH tasks with
otherwise flat scaling curves.

评估语言模型的任务套件 BIG-Bench 在多步推理方面的表现，通过应用 “chain-of-thought” 提示，可以提高模型性能，证明多数任务要求此类提示以获得更好的性能，并且此提示与模型规模具有交互作用。

挑战 BIG-Bench 任务及连贯思维是否能解决它们

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Language models demonstrate both quantitative improvement and new qualitative
capabilities with increasing scale. Despite their potentially transformative
impact, these new capabilities are as yet poorly characterized. In order to
inform future research, prepare for disruptive new model capabilities, and
ameliorate socially harmful effects, it is vital that we understand the present
and near-future capabilities and limitations of language models. To address
this challenge, we introduce the Beyond the Imitation Game benchmark
(BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442
authors across 132 institutions. Task topics are diverse, drawing problems from
linguistics, childhood development, math, common-sense reasoning, biology,
physics, social bias, software development, and beyond. BIG-bench focuses on
tasks that are believed to be beyond the capabilities of current language
models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense
transformer architectures, and Switch-style sparse transformers on BIG-bench,
across model sizes spanning millions to hundreds of billions of parameters. In
addition, a team of human expert raters performed all tasks in order to provide
a strong baseline. Findings include: model performance and calibration both
improve with scale, but are poor in absolute terms (and when compared with
rater performance); performance is remarkably similar across model classes,
though with benefits from sparsity; tasks that improve gradually and
predictably commonly involve a large knowledge or memorization component,
whereas tasks that exhibit "breakthrough" behavior at a critical scale often
involve multiple steps or components, or brittle metrics; social bias typically
increases with scale in settings with ambiguous context, but this can be
improved with prompting.

通过引入 Beyond the Imitation Game 基准测试（BIG-bench），我们评估了多种大小的语言模型在 204 个跨不同领域的任务上的表现，发现规模越大，其表现和校准也越好，但与人类专家相比还是很差，同时也发现在歧义上下文中情境偏见随规模增加而增加，但通过提示可以改善。