This study presents NewsBench, a novel benchmark framework developed to
evaluate the capability of Large Language Models (LLMs) in Chinese Journalistic
Writing Proficiency (JWP) and their Safety Adherence (SA), addressing the gap
between journalistic ethics and the risks associated with AI utilization.
Comprising 1,267 tasks across 5 editorial applications, 7 aspects (including
safety and journalistic writing with 4 detailed facets), and spanning 24 news
topics domains, NewsBench employs two GPT-4 based automatic evaluation
protocols validated by human assessment. Our comprehensive analysis of 11 LLMs
highlighted GPT-4 and ERNIE Bot as top performers, yet revealed a relative
deficiency in journalistic ethic adherence during creative writing tasks. These
findings underscore the need for enhanced ethical guidance in AI-generated
journalistic content, marking a step forward in aligning AI capabilities with
journalistic standards and safety considerations.

该研究提出了 NewsBench，这是一个新颖的基准框架，用于评估大型语言模型（LLMs）在中文新闻写作能力（JWP）和安全性遵循（SA）方面的能力，填补了新闻伦理和人工智能利用风险之间的差距。通过对 11 个 LLM 的综合分析，发现 GPT-4 和 ERNIE Bot 表现最佳，但在创造性写作任务中存在相对不足的新闻道德遵从性。这些发现强调了在 AI 生成的新闻内容中增强伦理指导的必要性，是将 AI 能力与新闻标准和安全考虑相一致的一步。

NewsBench：用于中国新闻编辑应用的判断写作能力和安全遵循的 LLM 系统性评估

NewsBench: Systematic Evaluation of LLMs for Writing Proficiency and  Safety Adherence in Chinese Journalistic Editorial Applications

High-risk domains pose unique challenges that require language models to
provide accurate and safe responses. Despite the great success of large
language models (LLMs), such as ChatGPT and its variants, their performance in
high-risk domains remains unclear. Our study delves into an in-depth analysis
of the performance of instruction-tuned LLMs, focusing on factual accuracy and
safety adherence. To comprehensively assess the capabilities of LLMs, we
conduct experiments on six NLP datasets including question answering and
summarization tasks within two high-risk domains: legal and medical. Further
qualitative analysis highlights the existing limitations inherent in current
LLMs when evaluating in high-risk domains. This underscores the essential
nature of not only improving LLM capabilities but also prioritizing the
refinement of domain-specific metrics, and embracing a more human-centric
approach to enhance safety and factual reliability. Our findings advance the
field toward the concerns of properly evaluating LLMs in high-risk domains,
aiming to steer the adaptability of LLMs in fulfilling societal obligations and
aligning with forthcoming regulations, such as the EU AI Act.

高风险领域中的语言模型性能评估是一个重要问题，本研究对指导调优的语言模型进行深入分析，重点关注事实准确性和安全性，通过在法律和医学两个高风险领域的六个自然语言处理数据集上进行实验，发现目前语言模型存在的局限性，并强调了提高语言模型能力和改进领域特定指标的重要性，以及通过更人性化的方法来增强安全性和事实可靠性，研究结果对于适应高风险领域、履行社会义务并符合即将颁布的欧盟 AI 法案具有重要推动作用。