Modern Large Language Models (LLMs) have showcased remarkable prowess in
various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a
paradoxical performance discrepancy is observed, where these models
underperform in seemingly elementary tasks like relation extraction and event
extraction due to two issues in conventional evaluation. (1) The imprecision of
existing evaluation metrics that struggle to effectively gauge semantic
consistency between model outputs and ground truth, and (2) The inherent
incompleteness of evaluation benchmarks, primarily due to restrictive human
annotation schemas, resulting in underestimated LLM performances. Inspired by
the principles in subjective question correction, we propose a new evaluation
method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through
subjective question correction data, to refine matching between model outputs
and golden labels. Additionally, by incorporating a Natural Language Inference
(NLI) model, SQC-Score enriches golden labels, addressing benchmark
incompleteness by acknowledging correct yet previously omitted answers. Results
on three information extraction tasks show that SQC-Score is more preferred by
human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a
comprehensive evaluation of the state-of-the-art LLMs and provide insights for
future research for information extraction. Dataset and associated codes can be
accessed at this https URL

利用主观问句纠错法评估了现代大型语言模型在信息提取任务中的性能，提出了 SQS-Score 评价方法，衡量输出结果与真实标签之间的语义一致性，并通过结合自然语言推理模型，丰富了评价标签，解决了评价标准中的缺陷，发现 SQS-Score 相较于基准度量更受人类标注者的偏好，并利用 SQS-Score 对最先进的大型语言模型进行了全面评估，为未来的信息提取研究提供了洞见。