BriefGPT.xyz
Sep, 2023
CulturaX:一个干净、庞大且多语言的数据集,适用于拥有167种语言的大型语言模型
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
HTML
PDF
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo...
TL;DR
CulturaX为大型语言模型提供一份多语种数据集,经过严格清洗和去重处理,解决了LLM开发中的透明度、幻觉和偏见问题,促进了多语种LLM的研究和发展。
Abstract
The driving factors behind the development of
large language models
(LLMs) with impressive learning capabilities are their colossal model sizes and extensive
training datasets
. Along with the progress in natural
→