The rapid development of Large Language Models (LLMs) has led to a surge in
applications that facilitate collaboration among multiple agents, assisting
humans in their daily tasks. However, a significant gap remains in assessing to
what extent LLM-powered applications genuinely enhance user experience and task
execution efficiency. This highlights the need to verify utility of LLM-powered
applications, particularly by ensuring alignment between the application's
functionality and end-user needs. We introduce AgentEval, a novel framework
designed to simplify the utility verification process by automatically
proposing a set of criteria tailored to the unique purpose of any given
application. This allows for a comprehensive assessment, quantifying the
utility of an application against the suggested criteria. We present a
comprehensive analysis of the effectiveness and robustness of AgentEval for two
open source datasets including Math Problem solving and ALFWorld House-hold
related tasks. For reproducibility purposes, we make the data, code and all the
logs publicly available at this https URL .

通过提出一套针对特定应用目的的标准，AgentEval 框架可以自动化地简化应用的效用验证过程，从而综合评估和量化该应用程序的效用。

评估和验证 LLM 驱动的应用中的任务效用

Assessing and Verifying Task Utility in LLM-Powered Applications

Presently, with the assistance of advanced LLM application development
frameworks, more and more LLM-powered applications can effortlessly augment the
LLMs' knowledge with external content using the retrieval augmented generation
(RAG) technique. However, these frameworks' designs do not have sufficient
consideration of the risk of external content, thereby allowing attackers to
undermine the applications developed with these frameworks. In this paper, we
reveal a new threat to LLM-powered applications, termed retrieval poisoning,
where attackers can guide the application to yield malicious responses during
the RAG process. Specifically, through the analysis of LLM application
frameworks, attackers can craft documents visually indistinguishable from
benign ones. Despite the documents providing correct information, once they are
used as reference sources for RAG, the application is misled into generating
incorrect responses. Our preliminary experiments indicate that attackers can
mislead LLMs with an 88.33\% success rate, and achieve a 66.67\% success rate
in the real-world application, demonstrating the potential impact of retrieval
poisoning.

LLM 应用开发、检索增强生成、LLM 应用、检索污染以及风险评估是本文的关键词。作者揭示了一种称为检索污染的新威胁，攻击者可以通过欺骗 LLM 应用程序在检索生成过程中生成恶意回应，对应用程序进行破坏。通过分析 LLM 应用程序框架，攻击者可以制作与正常文档在视觉上几乎无法区分的文档，一旦这些文档被用作检索增强生成的参考来源，应用程序就会产生错误的响应。初步实验表明攻击者可以以 88.33% 的成功率误导 LLM，并在现实世界的应用中达到 66.67% 的成功率，展示了检索污染的潜在影响。

LLM 技术应用中的人类不可感知检索污染攻击

Human-Imperceptible Retrieval Poisoning Attacks in LLM-Powered  Applications

The rapid development in the field of Large Language Models (LLMs) has led to
a surge in applications that facilitate collaboration among multiple agents to
assist humans in their daily tasks. However, a significant gap remains in
assessing whether LLM-powered applications genuinely enhance user experience
and task execution efficiency. This highlights the pressing need for methods to
verify utility of LLM-powered applications, particularly by ensuring alignment
between the application's functionality and end-user needs. We introduce
AgentEval provides an implementation for the math problems}, a novel framework
designed to simplify the utility verification process by automatically
proposing a set of criteria tailored to the unique purpose of any given
application. This allows for a comprehensive assessment, quantifying the
utility of an application against the suggested criteria. We present a
comprehensive analysis of the robustness of quantifier's work.

介绍了一种新的框架 AgentEval，用于验证大型语言模型（LLM）驱动应用程序的实用性，并提供一套与特定应用程序目标相符的评估标准，以全面评估其实用性。