Dialect identification is a critical task in speech processing and language technology, enhancing various applications such as speech recognition, speaker verification, and many others. While most research studies have been dedicated to dialect identification in widely spoken languages, limited attention has been given to dialect identification in low-resource languages, such as Romanian. To address this research gap, we introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We publicly release our dataset and code at https://github.com/codrut2/RoDia.

罗马尼亚方言识别是语音处理和语言技术中的关键任务，但研究主要集中在广泛使用的语言上，缺乏针对低资源语言（如罗马尼亚语）的研究。为填补这一研究空白，我们首次引入了罗马尼亚语方言识别的RoDia数据集，包含来自罗马尼亚五个不同地区的样本，并且包含了2小时的人工标注语音数据。同时，我们提供了一组竞争模型作为未来研究的基准。在该数据集上，最高得分的模型在宏观F1得分上达到59.83%，在微观F1得分上达到62.08%，显示该任务的挑战性。因此，我们相信RoDia是一个有价值的资源，将促进针对罗马尼亚方言识别挑战的研究。我们在此链接上公开发布我们的数据集和代码。

RoDia: 一个新的罗马尼亚方言识别语音数据集