This paper presents a benchmark self-evolving framework to dynamically
evaluate rapidly advancing Large Language Models (LLMs), aiming for a more
accurate assessment of their capabilities and limitations. We utilize a
multi-agent system to manipulate the context or question of original instances,
reframing new evolving instances with high confidence that dynamically extend
existing benchmarks. Towards a more scalable, robust and fine-grained
evaluation, we implement six reframing operations to construct evolving
instances testing LLMs against diverse queries, data noise and probing their
problem-solving sub-abilities. With this framework, we extend benchmark
datasets of four tasks. Experimental results show a general performance decline
in most LLMs against their original results. This decline under our scalable
and robust evaluations, alongside our fine-grained evaluation, more accurately
reflect models' capabilities. Besides, our framework widens performance
discrepancies both between different models and within the same model across
various tasks, facilitating more informed model selection for specific tasks
(Code and data are available at
this https URL).

该研究提出了一个基准的自我演进框架，动态评估迅速发展的大型语言模型（LLMs）的能力和限制，实施基于多智能体系统的重构操作来构建演进实例，对 LLMs 进行更可扩展、稳健和细粒度的评估，并发现它们在多个任务上的性能普遍下降。