Text embedding methods have become increasingly popular in both industrial
and academic fields due to their critical role in a variety of natural language
processing tasks. The significance of universal text embeddings has been
further highlighted with the rise of Large Language Models (LLMs) applications
such as Retrieval-Augmented Systems (RAGs). While previous models have
attempted to be general-purpose, they often struggle to generalize across tasks
and domains. However, recent advancements in training data quantity, quality
and diversity; synthetic data generation from LLMs as well as using LLMs as
backbones encourage great improvements in pursuing universal text embeddings.
In this paper, we provide an overview of the recent advances in universal text
embedding models with a focus on the top performing text embeddings on Massive
Text Embedding Benchmark (MTEB). Through detailed comparison and analysis, we
highlight the key contributions and limitations in this area, and propose
potentially inspiring future research directions.

通过对最近大规模文本嵌入基准测试中表现最好的文本嵌入进行详细比较和分析，本文概述了通用文本嵌入模型的最新进展，突出了该领域的关键贡献和局限，并提出了潜在的灵感未来研究方向。

文本嵌入的最新进展：MTEB 基准测试中最佳方法的综述

Recent advances in text embedding: A Comprehensive Review of  Top-Performing Methods on the MTEB Benchmark

We present Gecko, a compact and versatile text embedding model. Gecko
achieves strong retrieval performance by leveraging a key idea: distilling
knowledge from large language models (LLMs) into a retriever. Our two-step
distillation process begins with generating diverse, synthetic paired data
using an LLM. Next, we further refine the data quality by retrieving a set of
candidate passages for each query, and relabeling the positive and hard
negative passages using the same LLM. The effectiveness of our approach is
demonstrated by the compactness of the Gecko. On the Massive Text Embedding
Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing
entries with 768 embedding size. Gecko with 768 embedding dimensions achieves
an average score of 66.31, competing with 7x larger models and 5x higher
dimensional embeddings.

我们提出了一种紧凑而多功能的文本嵌入模型 Gecko，其通过利用大语言模型（LLMs）将知识从 LLMs 中提炼到检索器中来实现强大的检索性能。

Gecko：从大型语言模型中提取的多功能文本嵌入

Gecko: Versatile Text Embeddings Distilled from Large Language Models

We introduce a novel suite of state-of-the-art bilingual text embedding
models that are designed to support English and another target language. These
models are capable of processing lengthy text inputs with up to 8192 tokens,
making them highly versatile for a range of natural language processing tasks
such as text retrieval, clustering, and semantic textual similarity (STS)
calculations.
By focusing on bilingual models and introducing a unique multi-task learning
objective, we have significantly improved the model performance on STS tasks,
which outperforms the capabilities of existing multilingual models in both
target language understanding and cross-lingual evaluation tasks. Moreover, our
bilingual models are more efficient, requiring fewer parameters and less memory
due to their smaller vocabulary needs. Furthermore, we have expanded the
Massive Text Embedding Benchmark (MTEB) to include benchmarks for German and
Spanish embedding models. This integration aims to stimulate further research
and advancement in text embedding technologies for these languages.

该研究介绍了一套新颖的双语文本嵌入模型，能够处理长度长达 8192 个标记的文本输入，支持英语和目标语言，适用于文本检索、聚类和语义文本相似性计算等自然语言处理任务。通过专注于双语模型和引入独特的多任务学习目标，研究提高了 STS 任务模型性能，在目标语言理解和跨语言评估任务方面超过现有的多语言模型能力。此外，这些双语模型更高效，需要较少的参数和内存，因为它们具有较小的词汇需求。研究还扩展了大规模文本嵌入基准 (MTEB)，加入了德语和西班牙语嵌入模型的基准，旨在促进这些语言的文本嵌入技术的进一步研究和发展。

8192 个标记双语文本嵌入的多任务对比学习

Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

Text embeddings are commonly evaluated on a small set of datasets from a
single task not covering their possible applications to other tasks. It is
unclear whether state-of-the-art embeddings on semantic textual similarity
(STS) can be equally well applied to other tasks like clustering or reranking.
This makes progress in the field difficult to track, as various models are
constantly being proposed without proper evaluation. To solve this problem, we
introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding
tasks covering a total of 58 datasets and 112 languages. Through the
benchmarking of 33 models on MTEB, we establish the most comprehensive
benchmark of text embeddings to date. We find that no particular text embedding
method dominates across all tasks. This suggests that the field has yet to
converge on a universal text embedding method and scale it up sufficiently to
provide state-of-the-art results on all embedding tasks. MTEB comes with
open-source code and a public leaderboard at
this https URL.

本文介绍了 Massive Text Embedding Benchmark 评估了 33 种模型在 8 种嵌入任务和 112 种语言上的表现。结果发现，没有一种嵌入方法能够完全在所有任务上占优势，因此需要进一步研究和发展通用的文本嵌入方法。