Previous works on Large Language Models (LLMs) have mainly focused on
evaluating their helpfulness or harmlessness. However, honesty, another crucial
alignment criterion, has received relatively less attention. Dishonest
behaviors in LLMs, such as spreading misinformation and defrauding users,
eroding user trust, and causing real-world harm, present severe risks that
intensify as these models approach superintelligence levels. Enhancing honesty
in LLMs addresses critical deficiencies and helps uncover latent capabilities
that are not readily expressed. This underscores the urgent need for reliable
methods and benchmarks to effectively ensure and evaluate the honesty of LLMs.
In this paper, we introduce BeHonest, a pioneering benchmark specifically
designed to assess honesty in LLMs comprehensively. BeHonest evaluates three
essential aspects of honesty: awareness of knowledge boundaries, avoidance of
deceit, and consistency in responses. Building on this foundation, we designed
10 scenarios to evaluate and analyze 9 popular LLMs on the market, including
both closed-source and open-source models from different model families with
varied model sizes. Our findings indicate that there is still significant room
for improvement in the honesty of LLMs. We also encourage the AI community to
prioritize honesty alignment in LLMs. Our benchmark and code can be found at:
https://github.com/GAIR-NLP/BeHonest.

这篇论文介绍了一个名为 BeHonest 的新基准，旨在全面评估大型语言模型（LLMs）的诚实性，并强调了 LLMs 诚实性对现实世界的影响和迫切需要可靠方法和基准来确保和评估 LLMs 的诚实性。

BeHonest：大型语言模型诚实度基准测试

BeHonest: Benchmarking Honesty of Large Language Models

Large Language Models (LLMs) have achieved remarkable success across various
industries due to their exceptional generative capabilities. However, for safe
and effective real-world deployments, ensuring honesty and helpfulness is
critical. This paper addresses the question: Can we prioritize the helpfulness
of LLMs while preserving their honesty? To begin with, we establish exhaustive
principles aimed at guaranteeing the honesty of LLM. Additionally, we introduce
a novel dataset, referred to as HoneSet, comprising 930 queries spanning six
categories meticulously crafted to assess an LLM's capacity for maintaining
honesty. Subsequently, we present two approaches to augmenting honesty and
helpfulness in LLMs: a training-free enhancement and a fine-tuning-based
improvement. The training-free approach, which is based on curiosity-driven
prompting, empowers LLMs to articulate internal confusion and uncertainty
regarding queries, thereby optimizing their responses. Conversely, the
fine-tuning-based method employs a two-stage process inspired by curriculum
learning: initially instructing LLMs to discern between honest and dishonest
responses, then refining their training to enhance helpfulness. Experiments
conducted on nine prominent LLMs demonstrate a significant improvement in
alignment with honesty across all models through the implementation of our
proposed enhancements. Particularly noteworthy is the 65.3% enhancement
observed in Llama3-8b and the remarkable 124.7% improvement in Mistral-7b, as
measured by the H$^{2}$ (honest and helpful) assessment. We believe that our
work can pave the way for developing more trustworthy LLMs for real-world
applications.

这篇论文介绍了如何通过确保大语言模型的诚实和帮助性来优化其在实际应用中的表现，包括建立诚实的准则、引入数据集进行评估和提出两种增强诚实和帮助性的方法。实验证明，这些增强方法可以显著提升大语言模型的诚实性和帮助性，有望为开发更可靠的实际应用语言模型奠定基础。

两全其美：迈向一个真实和有用的大型语言模型

The Best of Both Worlds: Toward an Honest and Helpful Large Language  Model

Recent research has made significant strides in applying alignment techniques
to enhance the helpfulness and harmlessness of large language models (LLMs) in
accordance with human intentions. In this paper, we argue for the importance of
alignment for honesty, ensuring that LLMs proactively refuse to answer
questions when they lack knowledge, while still not being overly conservative.
However, a pivotal aspect of alignment for honesty involves discerning the
limits of an LLM's knowledge, which is far from straightforward. This challenge
demands comprehensive solutions in terms of metric development, benchmark
creation, and training methodologies. In this paper, we address these
challenges by first establishing a precise problem definition and defining
``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone
for developing metrics that effectively measure an LLM's honesty by quantifying
its progress post-alignment. Furthermore, we introduce a flexible training
framework which is further instantiated by several efficient fine-tuning
techniques that emphasize honesty without sacrificing performance on other
tasks. Our extensive experiments reveal that these aligned models show a marked
increase in honesty, as indicated by our proposed metrics. We open-source a
wealth of resources to facilitate future research at
this https URL, including honesty-aligned
models, training and evaluation datasets for honesty alignment, concept
glossary, as well as all relevant source code.

应用对齐技术以增强大型语言模型（LLMs）的有益性和无害性，保证其在人类意图下主动拒绝回答缺乏知识的问题时不会过于保守是至关重要的。本文通过建立明确的问题定义，以及定义了《论语》所启发的 “诚实” 的基石，解决了识别 LLM 知识限度的挑战，并引入了一个灵活的训练框架和几种强调诚实而不损害其他任务性能的有效微调技术，通过提出的度量方法，证明这些对齐模型在诚实性方面有显著提高。