While various vertical domain large language models (LLMs) have been developed, the challenge of automatically evaluating their performance across different domains remains significant. Current benchmark-based evaluation methods exhibit rigid, aimless interactions and rely on pre-collected static datasets that are costly to build, inflexible across domains, and misaligned with practical user needs. To address this issue, we revisit the evaluation components and introduce two concepts: Benchmark+, which extends traditional question-answer benchmark into a more flexible "strategy-criterion" format; and Assessment+, which enhances the interaction process, enabling deeper exploration and supporting both quantitative metrics and qualitative insights. These concepts capture the nuanced behaviors of LLMs through richer, multi-turn interactions. We propose an agent-based evaluation framework called TestAgent, which implements these concepts through retrieval augmented generation and reinforcement learning. Experiments on tasks ranging from constructing vertical domain evaluation to activating existing benchmarks demonstrate the effectiveness of TestAgent across various scenarios. We believe this work offers an interesting perspective on automatic evaluation for LLMs.

本研究解决了当前大型语言模型（LLMs）跨领域自动评估性能的挑战，指出了现有评估方法的局限性。通过引入Benchmark+和Assessment+的概念，本文提出了一种基于代理的动态评估框架TestAgent，利用检索增强生成和强化学习技术，能够支持更灵活、深入的交互过程。实验结果表明，TestAgent在多种场景下都表现出了良好的效果，推动了LLMs自动评估的研究进展。

重新审视基准和评估：基于代理的探索动态评估框架用于大型语言模型