Large language models (LLMs) are typically evaluated on the basis of task-based benchmarks such as MMLU. Such benchmarks do not examine responsible behaviour of LLMs in specific contexts. This is particularly true in the LGBTI+ context where social stereotypes may result in variation in LGBTI+ terminology. Therefore, domain-specific lexicons or dictionaries may be useful as a representative list of words against which the LLM's behaviour needs to be evaluated. This paper presents a methodology for evaluation of LLMs using an LGBTI+ lexicon in Indian languages. The methodology consists of four steps: formulating NLP tasks relevant to the expected behaviour, creating prompts that test LLMs, using the LLMs to obtain the output and, finally, manually evaluating the results. Our qualitative analysis shows that the three LLMs we experiment on are unable to detect underlying hateful content. Similarly, we observe limitations in using machine translation as means to evaluate natural language understanding in languages other than English. The methodology presented in this paper can be useful for LGBTI+ lexicons in other languages as well as other domain-specific lexicons. The work done in this paper opens avenues for responsible behaviour of LLMs, as demonstrated in the context of prevalent social perception of the LGBTI+ community.

该论文提出了一种使用印度语LGBTI+词汇表评估大型语言模型的方法，通过四个步骤：确定与期望行为相关的自然语言处理任务，创建用于测试语言模型的提示，使用语言模型获得输出，并进行手动评估。通过定性分析，我们发现我们实验的三个语言模型无法检测到潜在的仇恨内容，并且在使用机器翻译评估非英语语言的自然语言理解方面存在局限性。该论文提出的方法对其他语言的LGBTI+词汇表以及其他领域专用词表都有用处。这篇论文的研究工作为大型语言模型的负责任行为开辟了道路，如在LGBTI+社区的普遍社会认知背景下所示。

使用印度語LGBTI+詞彙檢視大型語言模型的評估