Large Language Models (LLMs) have achieved remarkable success in many formal
language oriented tasks, such as structural data-to-text and semantic parsing.
However current benchmarks mostly follow the data distribution of the
pre-training data of LLMs. Therefore, a natural question rises that do LLMs
really understand the structured semantics of formal languages. In this paper,
we investigate this problem on a special case, converse binary relation. We
introduce a new benchmark ConvRe focusing on converse relations, which contains
17 relations and 1240 triples extracted from popular knowledge graph completion
datasets. Our ConvRE features two tasks, Re2Text and Text2Re, which are
formulated as multi-choice question answering to evaluate LLMs' ability to
determine the matching between relations and associated text. For the
evaluation protocol, apart from different prompting methods, we further
introduce variants to the test text and few-shot example text. We conduct
experiments on three popular LLM families and have observed various scaling
trends. The results suggest that LLMs often resort to shortcut learning and
still face challenges on our proposed benchmark.

大型语言模型在形式化语言任务中取得了显著的成功，但目前的基准主要遵循 LLM 的预训练数据分布。本文探讨了 LLM 在一种特殊情况下的结构语义理解能力问题，提出了 ConvRe 基准，通过多项选择问答任务评估 LLM 确定关系和相关文本匹配的能力。实验结果表明，LLM 在该基准上仍存在挑战。