Large Language Models (LLMs) have demonstrated remarkable capability in a variety of NLP tasks. Despite their effectiveness, these models are prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence in its generated content, thereby aiding in the mitigation of nonfactual outputs. Existing research on UQ predominantly targets short text generation, typically yielding brief, word-limited responses. However, real-world applications frequently necessitate much longer responses. Our study first highlights the limitations of current UQ methods in handling long text generation. We then introduce \textsc{Luq}, a novel sampling-based UQ approach specifically designed for long text. Our findings reveal that \textsc{Luq} outperforms existing baseline methods in correlating with the model's factuality scores (negative coefficient of -0.85 observed for Gemini Pro). With \textsc{Luq} as the tool for UQ, we investigate behavior patterns of several popular LLMs' response confidence spectrum and how that interplays with the response' factuality. We identify that LLMs lack confidence in generating long text for rare facts and a factually strong model (i.e. GPT-4) tends to reject questions it is not sure about. To further improve the factual accuracy of LLM responses, we propose a method called \textsc{Luq-Ensemble} that ensembles responses from multiple models and selects the response with the least uncertainty. The ensembling method greatly improves the response factuality upon the best standalone LLM.

大语言模型（LLMs）在各种NLP任务中展示了非凡的能力。我们的研究首先强调了目前UQ方法在处理长文本生成时的局限性，然后介绍了Luq，一种专门设计用于长文本的基于采样的UQ方法。我们的发现表明，Luq在与模型的准确性分数相关性方面优于现有的基准方法。通过Luq作为UQ工具，我们调查了几个流行LLMs的响应信心谱行为模式及其与事实性响应的相互作用。我们发现LLMs在生成罕见事实的长文本上缺乏信心，而事实准确的模型（如GPT-4）倾向于拒绝其不确定的问题。为了进一步提高LLM响应的事实准确性，我们提出了一种称为Luq-Ensemble的方法，该方法对来自多个模型的响应进行集成并选择不确定性最小的响应。这种集成方法极大地提高了响应的事实性，超越了最佳独立LLM的表现。

LUQ：基于LLMs的长文本不确定性量化