Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. However, when applied to general-purpose software systems like operating systems, LLM agents face three primary challenges. Firstly, the action space is vast and dynamic, posing difficulties for LLM agents to maintain an up-to-date understanding and deliver accurate responses. Secondly, real-world tasks often require inter-application cooperation}, demanding farsighted planning from LLM agents. Thirdly, agents need to identify optimal solutions aligning with user constraints, such as security concerns and preferences. These challenges motivate AndroidArena, an environment and benchmark designed to evaluate LLM agents on a modern operating system. To address high-cost of manpower, we design a scalable and semi-automated method to construct the benchmark. In the task evaluation, AndroidArena incorporates accurate and adaptive metrics to address the issue of non-unique solutions. Our findings reveal that even state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to specific constraints. Additionally, we identify a lack of four key capabilities, i.e., understanding, reasoning, exploration, and reflection, as primary reasons for the failure of LLM agents. Furthermore, we provide empirical analysis on the failure of reflection, and improve the success rate by 27% with our proposed exploration strategy. This work is the first to present valuable insights in understanding fine-grained weakness of LLM agents, and offers a path forward for future research in this area. Environment, benchmark, and evaluation code for AndroidArena are released at https://github.com/AndroidArenaAgent/AndroidArena.

大型语言模型 (LLM) 在特定领域的软件（如浏览器和游戏）中赋予智能代理执行复杂任务的能力。然而，应用于操作系统等通用软件系统时，LLM代理面临三个主要挑战：广泛且动态的操作空间，跨应用程序的合作需求以及符合用户约束条件的最优解。本研究设计了环境和基准测试工具 AndroidArena，通过可扩展的、半自动化的方法构建了该基准。研究结果发现，即使是最先进的LLM代理在跨应用程序情景和遵守特定约束方面也存在困难。此外，通过对反思能力的失败进行实证分析，提出的探索策略将成功率提高了27%。该工作首次揭示了LLM代理的细粒度弱点，并为未来研究提供了方向。AndroidArena的环境、基准以及评估代码已在链接中公开发布。

复杂Android环境下大型语言模型代理的漏洞分析