Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection

本研究解决了识别最有效的大型语言模型用于代码异味检测的问题。提出了一种结构化的方法论和评估矩阵，并使用四种编程语言的数据集对两种前沿LLM进行基准测试。研究发现，分析表明不同模型在性能和成本效益方面的显著差异，为实践者在自动化代码异味检测中的解决方案选择提供了宝贵的指导。

基准测试大型语言模型用于代码异味检测：OpenAI GPT-4.0 与 DeepSeek-V3