Current evaluation benchmarks for question answering (QA) in Indic languages often rely on machine translation of existing English datasets. This approach suffers from bias and inaccuracies inherent in machine translation, leading to datasets that may not reflect the true capabilities of EQA models for Indic languages. This paper proposes a new benchmark specifically designed for evaluating Hindi EQA models and discusses the methodology to do the same for any task. This method leverages large language models (LLMs) to generate a high-quality dataset in an extractive setting, ensuring its relevance for the target language. We believe this new resource will foster advancements in Hindi NLP research by providing a more accurate and reliable evaluation tool.

为了在印度语中评估问答（QA）的当前评估基准，常常依赖于现有英语数据集的机器翻译。这种方法存在机器翻译中固有的偏见和不准确性，导致可能不能反映印度语EQA模型真实能力的数据集。本文提出了一个专门设计用于评估印度语EQA模型的新基准，并讨论了同样用于任何任务的方法。该方法利用大型语言模型（LLMs）在提取的环境中生成高质量的数据集，确保其对目标语言的相关性。我们相信这个新资源将通过提供更准确可靠的评估工具来促进印度语NLP研究的进展。

Suvach -- 生成的印地语QA基准