Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logitLens and then computes similarity scores between each layer's generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at https://github.com/microsoft/mice_for_cats.

本研究解决了工具使用代理在执行任务时的置信度校准问题，提出了一种新颖的模型内部置信度估计器（MICE），通过解码语言模型的中间层来评估置信度。研究发现，MICE在工具调用的效率和置信度上显著优于现有基线，具备样本效率高和对新API的零次泛化能力，能够在不同风险水平的场景中实现更高的工具调用效用。

用来调节工具代理的模型内部置信度估计的MICE