Popular benchmarks (e.g., XNLI) used to evaluate cross-lingual language
understanding consist of parallel versions of English evaluation sets in
multiple target languages created with the help of professional translators.
When creating such parallel data, it is critical to ensure high-quality
translations for all target languages for an accurate characterization of
cross-lingual transfer. In this work, we find that translation inconsistencies
do exist and interestingly they disproportionally impact low-resource languages
in XNLI. To identify such inconsistencies, we propose measuring the gap in
performance between zero-shot evaluations on the human-translated and
machine-translated target text across multiple target languages; relatively
large gaps are indicative of translation errors. We also corroborate that
translation errors exist for two target languages, namely Hindi and Urdu, by
doing a manual reannotation of human-translated test instances in these two
languages and finding poor agreement with the original English labels these
instances were supposed to inherit.

在评估跨语言语言理解的常用基准（如 XNLI）中，通过专业翻译人员创建用于多个目标语言的英文评估集的平行版本非常重要，以确保所有目标语言的高质量翻译，以准确地进行跨语言转移的表征。本研究发现，存在翻译的不一致性，并且这些不一致性在 XNLI 中对于低资源语言具有不成比例的影响。通过在多个目标语言的人工翻译和机器翻译目标文本之间进行零 - shot 评估的性能差距来识别这种不一致性，表现出相对较大的差距即为翻译错误的指示。此外，通过对印地语和乌尔都语这两种目标语言进行人工重新注释的方式，我们证实了翻译错误的存在，并发现这些实例与其原始的英文标签之间存在较差的一致性。

翻译错误对跨语言学习中的低资源语言有重大影响

Translation Errors Significantly Impact Low-Resource Languages in  Cross-Lingual Learning

Recent zero-shot evaluations have highlighted important limitations in the
abilities of language models (LMs) to perform meaning extraction. However, it
is now well known that LMs can demonstrate radical improvements in the presence
of experimental contexts such as in-context examples and instructions. How well
does this translate to previously studied meaning-sensitive tasks? We present a
case-study on the extent to which experimental contexts can improve LMs'
robustness in performing property inheritance -- predicting semantic properties
of novel concepts, a task that they have been previously shown to fail on. Upon
carefully controlling the nature of the in-context examples and the
instructions, our work reveals that they can indeed lead to non-trivial
property inheritance behavior in LMs. However, this ability is inconsistent:
with a minimal reformulation of the task, some LMs were found to pick up on
shallow, non-semantic heuristics from their inputs, suggesting that the
computational principles of semantic property inference are yet to be mastered
by LMs.

在先前研究中，语言模型在执行属性继承任务方面表现不佳，然而我们的研究表明，在实验环境中加入上下文示例和指令可以显著提高语言模型的鲁棒性，但这一能力并不一致，暗示着语言模型在语义属性推理的计算原则方面仍有待改进。

实验背景下可以促进语言模型中的稳健语义属性推理，但结果不一致

Experimental Contexts Can Facilitate Robust Semantic Property Inference  in Language Models, but Inconsistently

Large language models (LLMs) have been shown to perform well at a variety of
syntactic, discourse, and reasoning tasks. While LLMs are increasingly deployed
in many forms including conversational agents that interact with humans, we
lack a grounded benchmark to measure how well LLMs understand \textit{social}
language. Here, we introduce a new theory-driven benchmark, SocKET, that
contains 58 NLP tasks testing social knowledge which we group into five
categories: humor & sarcasm, offensiveness, sentiment & emotion, and
trustworthiness. In tests on the benchmark, we demonstrate that current models
attain only moderate performance but reveal significant potential for task
transfer among different types and categories of tasks, which were predicted
from theory. Through zero-shot evaluations, we show that pretrained models
already possess some innate but limited capabilities of social language
understanding and training on one category of tasks can improve zero-shot
testing on others. Our benchmark provides a systematic way to analyze model
performance on an important dimension of language and points to clear room for
improvement to build more socially-aware LLMs. The associated resources are
released at this https URL

介绍了一种名为 SocKET 的新理论驱动基准来测试大型语言模型在社交语言理解方面的性能，结果表明当前模型表现中等，但是存在不同类型和类别任务之间的任务转移潜力，同时使用零样本评估方法揭示了预训练模型已经具备了对社交语言理解的某些固有能力，这个基准提供了系统性的方式来分析模型在语言的重要维度上的性能，为构建更加符合社交意识的大型语言模型提供了指导。