Semantic consistency of a language model is broadly defined as the model's
ability to produce semantically-equivalent outputs, given
semantically-equivalent inputs. We address the task of assessing
question-answering (QA) semantic consistency of contemporary large language
models (LLMs) by manually creating a benchmark dataset with high-quality
paraphrases for factual questions, and release the dataset to the community.
We further combine the semantic consistency metric with additional
measurements suggested in prior work as correlating with LLM QA accuracy, for
building and evaluating a framework for factual QA reference-less performance
prediction -- predicting the likelihood of a language model to accurately
answer a question. Evaluating the framework on five contemporary LLMs, we
demonstrate encouraging, significantly outperforming baselines, results.

我们通过手动创建一个高质量的事实问答近义词语料库，并与先前工作中的其他相关度量标准相结合，以评估现代大型语言模型（LLMs）的语义一致性，以构建和评估一个用于事实问答参考无关性能预测的框架 -- 预测语言模型准确回答问题的可能性。通过对五个现代 LLMs 对该框架进行评估，我们展示了令人鼓舞的结果，显著超越了基线水平。