large language models (LLMs) are essential tools to collaborate with users on
different tasks. Evaluating their performance to serve users' needs in
real-world scenarios is important. While many benchmarks have been created,
they mainly focus on specific predefined model abilities. Few