Large Language Models (LLMs) are being deployed across various domains today.
However, their capacity to solve Capture the Flag (CTF) challenges in
cybersecurity has not been thoroughly evaluated. To address this, we develop a
novel method to assess LLMs in solving CTF challenges by creating a scalable,
open-source benchmark database specifically designed for these applications.
This database includes metadata for LLM testing and adaptive learning,
compiling a diverse range of CTF challenges from popular competitions.
Utilizing the advanced function calling capabilities of LLMs, we build a fully
automated system with an enhanced workflow and support for external tool calls.
Our benchmark dataset and automated framework allow us to evaluate the
performance of five LLMs, encompassing both black-box and open-source models.
This work lays the foundation for future research into improving the efficiency
of LLMs in interactive cybersecurity tasks and automated task planning. By
providing a specialized dataset, our project offers an ideal platform for
developing, testing, and refining LLM-based approaches to vulnerability
detection and resolution. Evaluating LLMs on these challenges and comparing
with human performance yields insights into their potential for AI-driven
cybersecurity solutions to perform real-world threat management. We make our
dataset open source to public this https URL
along with our playground automated framework
this https URL

我们开发了一种创新方法来评估大型语言模型（LLMs）在解决网络安全中的夺旗挑战方面的能力，通过创建一个专门针对这些应用设计的可扩展的开源基准数据库。利用 LLMs 的高级函数调用能力，我们构建了一个完全自动化的系统，具有改进的工作流程和对外部工具调用的支持。通过提供专门的数据集，我们的项目为开发、测试和改进基于 LLMs 的漏洞检测和解决方法提供了理想的平台。通过在这些挑战上评估 LLMs 并与人类表现进行比较，我们可以洞察 AI 驱动的网络安全解决方案在现实世界威胁管理方面的潜力。