Large language models (LLMs) have demonstrated impressive capabilities, but
still suffer from inconsistency issues (e.g. LLMs can react differently to
disturbances like rephrasing or inconsequential order change). In addition to
these inconsistencies, we also observe that LLMs, while capable of solving hard
problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy
inconsistency, we develop the ConsisEval benchmark, where each entry comprises
a pair of questions with a strict order of difficulty. Furthermore, we
introduce the concept of consistency score to quantitatively measure this
inconsistency and analyze the potential for improvement in consistency by
relative consistency score. Based on comprehensive experiments across a variety
of existing models, we find: (1) GPT-4 achieves the highest consistency score
of 92.2\% but is still inconsistent to specific questions due to distraction by
redundant information, misinterpretation of questions, etc.; (2) models with
stronger capabilities typically exhibit higher consistency, but exceptions also
exist; (3) hard data enhances consistency for both fine-tuning and in-context
learning. Our data and code will be publicly available on GitHub.

研究中提出了 ConsisEval 基准，用于量化大型语言模型的一致性，并通过相对一致性得分分析改进一致性的潜力。综合实验结果表明，尽管 GPT-4 的一致性得分最高，但仍然对特定问题存在不一致性，这可能是由于多余信息干扰、对问题的错误解读等因素导致的。而能力更强的模型通常具有更高的一致性，但也存在例外情况，并且硬数据能够提高微调和上下文学习的一致性。