Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

我们引入了LLM2Vec，这是一种简单的无监督方法，可以将任何解码器模型转换为强大的文本编码器，通过三个简单步骤：启用双向注意机制，掩码下一个标记预测和无监督对比学习，在英语词和序列级任务中，LLM2Vec在词级任务上远远超过编码器模型，在Massive Text Embeddings Benchmark (MTEB)中获得了新的无监督最新性能，通过与有监督的对比学习相结合，我们在MTEB上实现了使用公开数据训练的模型的最新性能。

LLM2Vec: 大型语言模型是强大的文本编码器