Large language models (LLMs), such as ChatGPT, have rapidly penetrated into people's work and daily lives over the past few years, due to their extraordinary conversational skills and intelligence. ChatGPT has become the fastest-growing software in terms of user numbers in human history and become an important foundational model for the next generation of artificial intelligence applications. However, the generations of LLMs are not entirely reliable, often producing content with factual errors, biases, and toxicity. Given their vast number of users and wide range of application scenarios, these unreliable responses can lead to many serious negative impacts. This thesis introduces the exploratory works in the field of language model reliability during the PhD study, focusing on the correctness, non-toxicity, and fairness of LLMs from both software testing and natural language processing perspectives. First, to measure the correctness of LLMs, we introduce two testing frameworks, FactChecker and LogicAsker, to evaluate factual knowledge and logical reasoning accuracy, respectively. Second, for the non-toxicity of LLMs, we introduce two works for red-teaming LLMs. Third, to evaluate the fairness of LLMs, we introduce two evaluation frameworks, BiasAsker and XCulturalBench, to measure the social bias and cultural bias of LLMs, respectively.

本研究解决了大型语言模型（LLMs）在正确性、非毒性和公平性方面的可靠性问题。通过引入FactChecker和LogicAsker两种测试框架，评估LLMs的事实知识和逻辑推理准确性，同时采用BiasAsker和XCulturalBench框架测量社会偏见和文化偏见。研究的最终发现表明，增强LLMs的准确性和公平性对于其在广泛应用中的安全性和有效性至关重要。

大型语言模型的测试与评估：正确性、非毒性与公平性