We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is 7x more efficient than OpenAI's multilingual tokenizer. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned web corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.

我们提出了VBART，这是第一个基于大型语料库从头开始预训练的土耳其序列到序列大型语言模型。VBART是基于BART和mBART模型的好主意的紧凑型语言模型，有两个不同尺寸的模型：大型和超大型。精调的VBART模型在抽象文本摘要、标题生成、文本改写、问答和问题生成任务中超过了之前的最先进结果。它们允许对未来的文本生成任务和数据集进行精调，为土耳其自然语言处理研究开辟了新的道路。我们的工作表明，对于土耳其语言模型的预训练比多语言模型提高了多达3倍，改进了现有结果，并为训练和推理提供了高效的模型。此外，我们展示了我们的单语tokenizer比OpenAI的多语tokenizer高效7倍。最后，我们介绍了一种扩大现有预训练语言模型的方法，并质疑了Chinchilla Scaling Law在序列到序列屏蔽语言模型中的相关性。我们的精调模型、tokenizer和清理后的135 GB网络语料库都可以在huggingface.co/vngrs-ai公开获取。

VBART：土耳其LLM