Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.

本研究评估了多语种大型语言模型的性能，发现GPT-4o和Llama-3 70B模型在大多数Indic语言中表现最佳。我们构建了两个评估设置的排行榜，并分析了人类评估和语言模型评估之间的一致性，发现在两两比较的设置下，人类和语言模型的一致性较高，但在直接评估中特别是对于孟加拉语和奥迪亚语等语言，一致性下降。我们还检测了人类和语言模型评估中的各种偏见，并发现GPT评估器存在自我偏见。本研究对多语种大型语言模型的评估具有重要意义。

PARIKSHA：多语言和跨文化数据上人类LLM评估者一致性的大规模调查