tinyBenchmarks: 用较少的样例评估LLM

Feb, 2024

tinyBenchmarks: evaluating LLMs with fewer examples

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu...

TL;DR通过研究LLM在各种关键基准测试中的表现，我们探索了减少LLM性能评估所需评估次数的策略，并发布了评估工具和微型基准测试，证明这些工具和测试足以可靠高效地复现原始评估结果。

Abstract

The versatility of large language models (llms) led to the creation of diverse benchmarks that thoroughly test a variety of language model