Evaluating the ability of large language models (LLMs) to follow complex human-written instructions is essential for their deployment in real-world applications. While benchmarks like Chatbot Arena use human judges to assess model performance, they are resource-intensive and time-consuming. Alternative methods using LLMs as judges, such as AlpacaEval, MT Bench, WildBench, and InFoBench offer improvements but still do not capture that certain complex instruction aspects are more important than others to follow. To address this gap, we propose a novel evaluation metric, \textsc{TOWER}, that incorporates human-judged importance into the assessment of complex instruction following. We show that human annotators agree with tree-based representations of these complex instructions nearly as much as they agree with other human annotators. We release tree-based annotations of the InFoBench dataset and the corresponding evaluation code to facilitate future research.

本研究针对当前评估大语言模型（LLMs）遵循复杂人类指令的方式存在时间和资源消耗大的问题，提出了一种新颖的评估标准“TOWER”。该方法整合了人类评审的重要性判断，研究发现人类注释者对复杂指令的树状表示与其他人类注释者的共识度几乎相同，从而提升了评估的准确性和效率。

塔式评估：复杂指令评估的树形组织加权方法