In this paper, we introduce a novel and simple method for obtaining
high-quality text embeddings using only synthetic data and less than 1k
training steps. Unlike existing methods that often depend on multi-stage
intermediate pre-training with billions of weakly-supervised text pairs,
followed by fine-tuning with a few labeled datasets, our method does not
require building complex training pipelines or relying on manually collected
datasets that are often constrained by task diversity and language coverage. We
leverage proprietary LLMs to generate diverse synthetic data for hundreds of
thousands of text embedding tasks across nearly 100 languages. We then
fine-tune open-source decoder-only LLMs on the synthetic data using standard
contrastive loss. Experiments demonstrate that our method achieves strong
performance on highly competitive text embedding benchmarks without using any
labeled data. Furthermore, when fine-tuned with a mixture of synthetic and
labeled data, our model sets new state-of-the-art results on the BEIR and MTEB
benchmarks.

通过使用合成数据和少于 1k 个训练步骤，我们引入了一种获取高质量文本嵌入的新颖简单方法。与现有方法不同，我们的方法不需要构建复杂的训练流程或依赖于常常受到任务多样性和语言覆盖性限制的人工收集的数据集。通过利用专有 LLMs 在近 100 种语言中生成大量多样化的合成数据，我们使用标准对比损失在合成数据上微调开源的只解码 LLMs。实验证明，我们的方法在高度竞争的文本嵌入基准上具有强大的性能，而不使用任何标记数据。此外，当用合成数据和标记数据的混合进行微调时，我们的模型在 BEIR 和 MTEB 基准上创造了最新的技术成果。

利用大型语言模型改进文本嵌入

Improving Text Embeddings with Large Language Models

In this paper we report on our submission to the Multidocument Summarisation
for Literature Review (MSLR) shared task. Specifically, we adapt PRIMERA (Xiao
et al., 2022) to the biomedical domain by placing global attention on important
biomedical entities in several ways. We analyse the outputs of the 23 resulting
models, and report patterns in the results related to the presence of
additional global attention, number of training steps, and the input
configuration.

本文研究对 PRIMERA 进行调整以适应生物医学领域，其中全局关注了几种重要的生物医学实体，并分析了 23 个模型的结果，结果表明全局关注、训练步骤数量和输入配置等因素会影响结果的模式。