The success of Natural Language Understanding (NLU) benchmarks in various
languages, such as GLUE for English, CLUE for Chinese, KLUE for Korean, and
IndoNLU for Indonesian, has facilitated the evaluation of new NLU models across
a wide range of tasks. To establish a standardized set of benchmarks for
Vietnamese NLU, we introduce the first Vietnamese Language Understanding
Evaluation (VLUE) benchmark. The VLUE benchmark encompasses five datasets
covering different NLU tasks, including text classification, span extraction,
and natural language understanding. To provide an insightful overview of the
current state of Vietnamese NLU, we then evaluate seven state-of-the-art
pre-trained models, including both multilingual and Vietnamese monolingual
models, on our proposed VLUE benchmark. Furthermore, we present CafeBERT, a new
state-of-the-art pre-trained model that achieves superior results across all
tasks in the VLUE benchmark. Our model combines the proficiency of a
multilingual pre-trained model with Vietnamese linguistic knowledge. CafeBERT
is developed based on the XLM-RoBERTa model, with an additional pretraining
step utilizing a significant amount of Vietnamese textual data to enhance its
adaptation to the Vietnamese language. For the purpose of future research,
CafeBERT is made publicly available for research purposes.

为了评估新的自然语言理解模型在一系列任务上的表现，我们引入了第一个越南语语言理解评估（VLUE）基准，涵盖了五个不同的 NLU 任务，包括文本分类、跨度提取和自然语言理解。我们评估了七个最先进的预训练模型在我们提出的 VLUE 基准上的表现，包括多语言和越南语单语模型，并提出了 CafeBERT，一个在 VLUE 基准中所有任务上都取得优秀结果的最新预训练模型。

VLUE：越南自然语言理解的新基准和多任务知识迁移学习

VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for  Vietnamese Natural Language Understanding

Recent advances in vision-language pre-training (VLP) have demonstrated
impressive performance in a range of vision-language (VL) tasks. However, there
exist several challenges for measuring the community's progress in building
general multi-modal intelligence. First, most of the downstream VL datasets are
annotated using raw images that are already seen during pre-training, which may
result in an overestimation of current VLP models' generalization ability.
Second, recent VLP work mainly focuses on absolute performance but overlooks
the efficiency-performance trade-off, which is also an important indicator for
measuring progress.
To this end, we introduce the Vision-Language Understanding Evaluation (VLUE)
benchmark, a multi-task multi-dimension benchmark for evaluating the
generalization capabilities and the efficiency-performance trade-off (``Pareto
SOTA'') of VLP models. We demonstrate that there is a sizable generalization
gap for all VLP models when testing on out-of-distribution test sets annotated
on images from a more diverse distribution that spreads across cultures.
Moreover, we find that measuring the efficiency-performance trade-off of VLP
models leads to complementary insights for several design choices of VLP. We
release the VLUE benchmark to promote research on building vision-language
models that generalize well to more diverse images and concepts unseen during
pre-training, and are practical in terms of efficiency-performance trade-off.

本研究介绍了一个名为 VLUE 的视觉语言理解评估基准，可用于评估 VLP 模型的泛化能力和效率 - 性能权衡。该基准显示了所有 VLP 模型在处理来自更多文化领域且未在预训练中出现的图像时存在较大的泛化差距，并且衡量 VLP 模型的效率 - 性能权衡可为设计选择提供有益见解。