An important unexplored aspect in previous work on user satisfaction
estimation for Task-Oriented Dialogue (TOD) systems is their evaluation in
terms of robustness for the identification of user dissatisfaction: current
benchmarks for user satisfaction estimation in TOD systems are highly skewed
towards dialogues for which the user is satisfied. The effect of having a more
balanced set of satisfaction labels on performance is unknown. However,
balancing the data with more dissatisfactory dialogue samples requires further
data collection and human annotation, which is costly and time-consuming. In
this work, we leverage large language models (LLMs) and unlock their ability to
generate satisfaction-aware counterfactual dialogues to augment the set of
original dialogues of a test collection. We gather human annotations to ensure
the reliability of the generated samples. We evaluate two open-source LLMs as
user satisfaction estimators on our augmented collection against
state-of-the-art fine-tuned models. Our experiments show that when used as
few-shot user satisfaction estimators, open-source LLMs show higher robustness
to the increase in the number of dissatisfaction labels in the test collection
than the fine-tuned state-of-the-art models. Our results shed light on the need
for data augmentation approaches for user satisfaction estimation in TOD
systems. We release our aligned counterfactual dialogues, which are curated by
human annotation, to facilitate further research on this topic.

利用大型语言模型 (LLMs) 生成注重满意度的反事实对话以增加测试集中的原始对话样本，并通过人工注释验证，研究表明，与最先进的微调模型相比，开源的大型语言模型作为少样本的用户满意度评估器，在测试集中不满意标签数量的增加时表现出更高的鲁棒性。

任务导向对话系统中用户满意度估计的因果评估

CAUSE: Counterfactual Assessment of User Satisfaction Estimation in  Task-Oriented Dialogue Systems

The tasks of out-of-domain (OOD) intent discovery and generalized intent
discovery (GID) aim to extend a closed intent classifier to open-world intent
sets, which is crucial to task-oriented dialogue (TOD) systems. Previous
methods address them by fine-tuning discriminative models. Recently, although
some studies have been exploring the application of large language models
(LLMs) represented by ChatGPT to various downstream tasks, it is still unclear
for the ability of ChatGPT to discover and incrementally extent OOD intents. In
this paper, we comprehensively evaluate ChatGPT on OOD intent discovery and
GID, and then outline the strengths and weaknesses of ChatGPT. Overall, ChatGPT
exhibits consistent advantages under zero-shot settings, but is still at a
disadvantage compared to fine-tuned models. More deeply, through a series of
analytical experiments, we summarize and discuss the challenges faced by LLMs
including clustering, domain-specific understanding, and cross-domain
in-context learning scenarios. Finally, we provide empirical guidance for
future directions to address these challenges.

ChatGPT 对 OOD 意图探索和广义意图探索进行了全面评估，并概述了 ChatGPT 的优势和劣势。ChatGPT 在零样本设置下展现了一致的优势，但与微调模型相比仍处于劣势。通过一系列分析实验，我们总结和讨论了 LLM 面临的挑战，包括聚类、领域特定理解和跨领域情境学习场景。最后，我们提供了未来解决这些挑战的经验指导。