Large Language Models (LLMs) have achieved remarkable success in many formal
language oriented tasks, such as structural data-to-text and semantic parsing.
However current benchmarks mostly follow the data distribution of the
pre-training data of LLMs. Therefore, a natural question rises that do LLMs
really understand the structured semantics of formal languages. In this paper,
we investigate this problem on a special case, converse binary relation. We
introduce a new benchmark ConvRe focusing on converse relations, which contains
17 relations and 1240 triples extracted from popular knowledge graph completion
datasets. Our ConvRE features two tasks, Re2Text and Text2Re, which are
formulated as multi-choice question answering to evaluate LLMs' ability to
determine the matching between relations and associated text. For the
evaluation protocol, apart from different prompting methods, we further
introduce variants to the test text and few-shot example text. We conduct
experiments on three popular LLM families and have observed various scaling
trends. The results suggest that LLMs often resort to shortcut learning and
still face challenges on our proposed benchmark.

大型语言模型在形式化语言任务中取得了显著的成功，但目前的基准主要遵循 LLM 的预训练数据分布。本文探讨了 LLM 在一种特殊情况下的结构语义理解能力问题，提出了 ConvRe 基准，通过多项选择问答任务评估 LLM 确定关系和相关文本匹配的能力。实验结果表明，LLM 在该基准上仍存在挑战。

LLM 在理解逆关系上的无效性研究

An Investigation of LLMs' Inefficacy in Understanding Converse Relations

As the scale of machine learning models increases, trends such as scaling
laws anticipate consistent downstream improvements in predictive accuracy.
However, these trends take the perspective of a single model-provider in
isolation, while in reality providers often compete with each other for users.
In this work, we demonstrate that competition can fundamentally alter the
behavior of these scaling trends, even causing overall predictive accuracy
across users to be non-monotonic or decreasing with scale. We define a model of
competition for classification tasks, and use data representations as a lens
for studying the impact of increases in scale. We find many settings where
improving data representation quality (as measured by Bayes risk) decreases the
overall predictive accuracy across users (i.e., social welfare) for a
marketplace of competing model-providers. Our examples range from closed-form
formulas in simple settings to simulations with pretrained representations on
CIFAR-10. At a conceptual level, our work suggests that favorable scaling
trends for individual model-providers need not translate to downstream
improvements in social welfare in marketplaces with multiple model providers.

本研究分析了在多家模型提供商竞争市场环境下，机器学习模型的规模对预测准确度的影响，并发现在某些情况下，即使提高数据表示质量可以减少贝叶斯风险，但整体而言也会导致跨用户的总体预测准确性下降。