Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.

提出了一种在低资源语言中生成和验证问题回答数据集的方法 SynDARin，该方法利用平行内容挖掘获得英文和目标语言之间的人工精选段落，使用英语数据作为上下文生成合成的多项选择问题-回答对，并经过自动翻译和质量验证。人类评估显示，生成的英文数据在问题类型和主题方面保持了 98% 的质量和多样性，翻译验证流程能够过滤掉约 70% 质量差的数据。使用数据集对最先进的大模型进行评估表明，它们无法达到人类的准确性，部分模型的表现接近随机机会。这表明生成的数据集非平凡，并可用于评估低资源语言中的推理能力。

SynDARin: 用于低资源语言的自动推理数据集合成