Assessing the performance of interpreting services is a complex task, given
the nuanced nature of spoken language translation, the strategies that
interpreters apply, and the diverse expectations of users. The complexity of
this task become even more pronounced when automated evaluation methods are
applied. This is particularly true because interpreted texts exhibit less
linearity between the source and target languages due to the strategies
employed by the interpreter.
This study aims to assess the reliability of automatic metrics in evaluating
simultaneous interpretations by analyzing their correlation with human
evaluations. We focus on a particular feature of interpretation quality, namely
translation accuracy or faithfulness. As a benchmark we use human assessments
performed by language experts, and evaluate how well sentence embeddings and
Large Language Models correlate with them. We quantify semantic similarity
between the source and translated texts without relying on a reference
translation. The results suggest GPT models, particularly GPT-3.5 with direct
prompting, demonstrate the strongest correlation with human judgment in terms
of semantic similarity between source and target texts, even when evaluating
short textual segments. Additionally, the study reveals that the size of the
context window has a notable impact on this correlation.

评估口译服务的表现是一项复杂的任务，尤其是在应用自动评估方法时，本研究旨在通过分析自动度量与人工评估之间的相关性来评估同传口译的可靠性，结果表明 GPT 模型，特别是 GPT-3.5 具有最强的语义相似性相关性，即使在评估短文本片段时也是如此。

探究人机评估并行口语翻译的相关性

Exploring the Correlation between Human and Machine Evaluation of  Simultaneous Speech Translation

Opinion summarization sets itself apart from other types of summarization
tasks due to its distinctive focus on aspects and sentiments. Although certain
automated evaluation methods like ROUGE have gained popularity, we have found
them to be unreliable measures for assessing the quality of opinion summaries.
In this paper, we present OpinSummEval, a dataset comprising human judgments
and outputs from 14 opinion summarization models. We further explore the
correlation between 24 automatic metrics and human ratings across four
dimensions. Our findings indicate that metrics based on neural networks
generally outperform non-neural ones. However, even metrics built on powerful
backbones, such as BART and GPT-3/3.5, do not consistently correlate well
across all dimensions, highlighting the need for advancements in automated
evaluation methods for opinion summarization. The code and data are publicly
available at this https URL

观点总结与其他类型的总结任务有所不同，因为其独特关注于方面和情感。本文介绍了 OpinSummEval，它是一个包含人工评价和 14 个观点总结模型输出的数据集。我们进一步探讨了 24 个自动评估指标与人工评分之间在四个维度上的相关性。结果表明，基于神经网络的指标通常优于非神经网络的指标。然而，即使是构建在强大的模型基础之上，如 BART 和 GPT-3/3.5，也不能在所有维度上一致地与人工评分相关，凸显了观点总结自动评估方法的进步需求。代码和数据可在此 URL 公开获取。