This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .

本技术报告介绍了开源多语言E5文本嵌入模型的训练方法和评估结果，该模型于2023年中期发布。提供了三种不同大小的嵌入模型（小/基础/大），在推理效率和嵌入质量之间取得平衡。训练过程遵循英文E5模型的方法，包括对10亿个多语言文本对进行对比预训练，然后在一系列标记数据集上进行微调。此外，我们引入了一种新的指令调整嵌入模型，其性能与类似规模的最先进英文模型相当。有关模型发布的信息可以在此https网址中找到。

多语种 E5 文本嵌入：技术报告