Recently, code language models have achieved notable advancements in addressing a diverse array of essential code comprehension and generation tasks. Yet, the field lacks a comprehensive deep dive and understanding of the code embeddings of multilingual code models. In this paper, we present a comprehensive study on multilingual code embeddings, focusing on the cross-lingual capabilities of these embeddings across different programming languages. Through probing experiments, we demonstrate that code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details, primarily focusing on semantics. Further, we show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks, leading to an absolute increase of up to +17 in the Mean Reciprocal Rank (MRR).

本研究通过分析逐渐增长的跨语言代码模型的代码嵌入，展示了代码嵌入包含两个不同组成部分，一个与特定语言的细微差别和语法紧密相连，另一个则与此类细节无关，主要关注语义。此外，我们证明在去除特定语言组成部分后，下游代码检索任务有着显著改进，平均逆向排名 (MRR) 可达+17的绝对增益。

语言无关代码嵌入