Large language models (LLMs) have demonstrated impressive capabilities, but
still suffer from inconsistency issues (e.g. LLMs can react differently to
disturbances like rephrasing or inconsequential order change). In addition to
these inconsistencies, we also observe that LLMs, while capable of solving hard
problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy
inconsistency, we develop the ConsisEval benchmark, where each entry comprises
a pair of questions with a strict order of difficulty. Furthermore, we
introduce the concept of consistency score to quantitatively measure this
inconsistency and analyze the potential for improvement in consistency by
relative consistency score. Based on comprehensive experiments across a variety
of existing models, we find: (1) GPT-4 achieves the highest consistency score
of 92.2\% but is still inconsistent to specific questions due to distraction by
redundant information, misinterpretation of questions, etc.; (2) models with
stronger capabilities typically exhibit higher consistency, but exceptions also
exist; (3) hard data enhances consistency for both fine-tuning and in-context
learning. Our data and code will be publicly available on GitHub.

研究中提出了 ConsisEval 基准，用于量化大型语言模型的一致性，并通过相对一致性得分分析改进一致性的潜力。综合实验结果表明，尽管 GPT-4 的一致性得分最高，但仍然对特定问题存在不一致性，这可能是由于多余信息干扰、对问题的错误解读等因素导致的。而能力更强的模型通常具有更高的一致性，但也存在例外情况，并且硬数据能够提高微调和上下文学习的一致性。

大型语言模型是否总能解决简单问题，如果它们能解决更困难的呢？

Can Large Language Models Always Solve Easy Problems if They Can Solve  Harder Ones?

Humans are accustomed to environments that contain both regularities and
exceptions. For example, at most gas stations, one pays prior to pumping, but
the occasional rural station does not accept payment in advance. Likewise, deep
neural networks can generalize across instances that share common patterns or
structures, yet have the capacity to memorize rare or irregular forms. We
analyze how individual instances are treated by a model via a consistency
score. The score characterizes the expected accuracy for a held-out instance
given training sets of varying size sampled from the data distribution. We
obtain empirical estimates of this score for individual instances in multiple
data sets, and we show that the score identifies out-of-distribution and
mislabeled examples at one end of the continuum and strongly regular examples
at the other end. We identify computationally inexpensive proxies to the
consistency score using statistics collected during training. We show examples
of potential applications to the analysis of deep-learning systems.

该论文分析了神经网络模型对个别实例的处理方式，通过一致性得分来表征模型的预期准确度，并使用采样自数据分布的不同大小的训练集对多个数据集中的个别实例进行了实证估计，从而确定模型处理模式与模式的一致性，该方法可应用于分析深度学习系统中的过分拟合问题。