Automatic text simplification (TS) aims to automate the process of rewriting
text to make it easier for people to read. A pre-requisite for TS to be useful
is that it should convey information that is consistent with the meaning of the
original text. However, current TS evaluation protocols assess system outputs
for simplicity and meaning preservation without regard for the document context
in which output sentences occur and for how people understand them. In this
work, we introduce a human evaluation framework to assess whether simplified
texts preserve meaning using reading comprehension questions. With this
framework, we conduct a thorough human evaluation of texts by humans and by
nine automatic systems. Supervised systems that leverage pre-training knowledge
achieve the highest scores on the reading comprehension (RC) tasks amongst the
automatic controllable TS systems. However, even the best-performing supervised
system struggles with at least 14% of the questions, marking them as
"unanswerable'' based on simplified content. We further investigate how
existing TS evaluation metrics and automatic question-answering systems
approximate the human judgments we obtained.

自动文本简化（TS）旨在自动化重写文本的过程，使人们更容易阅读。本研究引入了人类评估框架以评估简化文本是否保留了含义，并通过阅读理解问题对文本进行了深入的人类评估和九种自动系统评估。

文本简化系统是否保留含义？通过阅读理解的人工评估

Do Text Simplification Systems Preserve Meaning? A Human Evaluation via  Reading Comprehension

Large language models (LLMs) have shown impressive capabilities across
various natural language tasks. However, evaluating their alignment with human
preferences remains a challenge. To this end, we propose a comprehensive human
evaluation framework to assess LLMs' proficiency in following instructions on
diverse real-world tasks. We construct a hierarchical task tree encompassing 7
major areas covering over 200 categories and over 800 tasks, which covers
diverse capabilities such as question answering, reasoning, multiturn dialogue,
and text generation, to evaluate LLMs in a comprehensive and in-depth manner.
We also design detailed evaluation standards and processes to facilitate
consistent, unbiased judgments from human evaluators. A test set of over 3,000
instances is released, spanning different difficulty levels and knowledge
domains. Our work provides a standardized methodology to evaluate human
alignment in LLMs for both English and Chinese. We also analyze the feasibility
of automating parts of evaluation with a strong LLM (GPT-4). Our framework
supports a thorough assessment of LLMs as they are integrated into real-world
applications. We have made publicly available the task tree, TencentLLMEval
dataset, and evaluation methodology which have been demonstrated as effective
in assessing the performance of Tencent Hunyuan LLMs. By doing so, we aim to
facilitate the benchmarking of advances in the development of safe and
human-aligned LLMs.

通过构建一个综合的人工评估框架，我们提出了一个评估大规模语言模型在不同实际任务中遵循指令的能力的方法，同时设计了详细的评估标准和过程，释放了一个包含不同难度水平和知识领域的测试集，并分析了自动化评估的可行性。我们的研究为评估英语和中文大规模语言模型的人类对齐性提供了一个标准化的方法，旨在促进安全和人类对齐性大规模语言模型发展进步的基准化。

腾讯 LLMEval：人类对齐的 LLMs 的实际能力的层次评估

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for  Human-Aligned LLMs

As AI technology is increasingly applied to high-impact, high-risk domains,
there have been a number of new methods aimed at making AI models more human
interpretable. Despite the recent growth of interpretability work, there is a
lack of systematic evaluation of proposed techniques. In this work, we
introduce HIVE (Human Interpretability of Visual Explanations), a novel human
evaluation framework that assesses the utility of explanations to human users
in AI-assisted decision making scenarios, and enables falsifiable hypothesis
testing, cross-method comparison, and human-centered evaluation of visual
interpretability methods. To the best of our knowledge, this is the first work
of its kind. Using HIVE, we conduct IRB-approved human studies with nearly 1000
participants and evaluate four methods that represent the diversity of computer
vision interpretability works: GradCAM, BagNet, ProtoPNet, and ProtoTree. Our
results suggest that explanations engender human trust, even for incorrect
predictions, yet are not distinct enough for users to distinguish between
correct and incorrect predictions. We open-source HIVE to enable future studies
and encourage more human-centered approaches to interpretability research.

本研究提出了针对人类用户视觉解释的人类解释性评估框架 HIVE，通过对四种不同计算机视觉解释方法的评估，结果表明解释可以引起人类的信任，但人们难以区分解释的正确性，该框架开源以便未来研究和鼓励更多的以人为中心的解释性研究。