TL;DR通过引入一种名为 PandaLM 的大型语言模型来更公平地评估大型语言模型,该模型不依赖于基于 API 的评估,能够相对简要地比对大量 GPT 系列模型的效果,从而实现了最优超参数选择的自动化、健壮和可靠评估基准的确定。
Abstract
Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable →