Generative chat models, such as ChatGPT and GPT-4, have revolutionized natural language generation (NLG) by incorporating instructions and human feedback to achieve significant performance improvements. However, the lack of standardized evaluation benchmarks for chat models, particularly for Chinese and domain-specific models, hinders their assessment and progress. To address this gap, we introduce the Chinese Generative Chat Evaluation (CGCE) benchmark, focusing on general and financial domains. The CGCE benchmark encompasses diverse tasks, including 200 questions in the general domain and 150 specific professional questions in the financial domain. Manual scoring evaluates factors such as accuracy, coherence, expression clarity, and completeness. The CGCE benchmark provides researchers with a standardized framework to assess and compare Chinese generative chat models, fostering advancements in NLG research.

引入中文生成式聊天评估基准（CGCE）基准，旨在评估和比较生成模型。该基准由200个一般领域问题和150个专业财务领域问题组成，可评估精确性、条理性、表达清晰度和完成度等因素，为研究人员提供标准框架，促进自然语言生成研究的发展。

CGCE: 一个用于普及和金融领域的中文生成式聊天评估基准