TL;DRLR-Sum是一个使用创意共享许可证创建的包含40种语言的分别用于多语言新闻概要的数据集,旨在促进对低资源语言自动概括的进一步研究。本文描述了从Multilingual Open Text语料库中提取和筛选出该数据集的过程,同时讨论了该数据集的限制和用途。
Abstract
This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced