Speech encompasses a wealth of information, including but not limited to
content, paralinguistic, and environmental information. This comprehensive
nature of speech significantly impacts communication and is crucial for
human-computer interaction. Chat-Oriented Large Language Models (LLMs), known
for their general-purpose assistance capabilities, have evolved to handle
multi-modal inputs, including speech. Although these models can be adept at
recognizing and analyzing speech, they often fall short of generating
appropriate responses. We argue that this is due to the lack of principles on
task definition and model development, which requires open-source datasets and
metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a
benchmark dataset aimed at multidimensional evaluation of spoken dialogue
understanding and generation. SD-Eval focuses on paralinguistic and
environmental information and includes 7,303 utterances, amounting to 8.76
hours of speech data. The data is aggregated from eight public datasets,
representing four perspectives: emotion, accent, age, and background sound. To
assess the SD-Eval benchmark dataset, we implement three different models and
construct a training set following a similar process as SD-Eval. The training
set contains 1,052.72 hours of speech data and 724.4k utterances. We also
conduct a comprehensive evaluation using objective evaluation methods (e.g.
BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated
responses. Models conditioned with paralinguistic and environmental information
outperform their counterparts in both objective and subjective measures.
Moreover, experiments demonstrate LLM-based metrics show a higher correlation
with human evaluation compared to traditional metrics. We open-source SD-Eval
at this https URL

为了评估和改进大型语言模型在口语对话理解和生成方面的能力，我们提出了 SD-Eval 标准数据集，该数据集聚合了代表情感、口音、年龄和背景声音的四个维度的 7,303 个话语，总计 8.76 个小时的语音数据，并通过客观和主观评估方法，以及基于大型语言模型的指标，证明了在任务定义和模型开发中使用语音的附加信息可以显著提高生成响应的质量。

SD-Eval: 口语对话理解的基准数据集超越文本

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond  Words

We investigate the potential of ChatGPT as a multidimensional evaluator for
the task of \emph{Text Style Transfer}, alongside, and in comparison to,
existing automatic metrics as well as human judgements. We focus on a zero-shot
setting, i.e. prompting ChatGPT with specific task instructions, and test its
performance on three commonly-used dimensions of text style transfer
evaluation: style strength, content preservation, and fluency. We perform a
comprehensive correlation analysis for two transfer directions (and overall) at
different levels. Compared to existing automatic metrics, ChatGPT achieves
competitive correlations with human judgments. These preliminary results are
expected to provide a first glimpse into the role of large language models in
the multidimensional evaluation of stylized text generation.

本文旨在通过评估 ChatGPT 在文本风格转换多维度评估中的作用，与现有的自动度量以及人类判断的对比。结果表明，在不同等级下，与现有的自动度量相比，ChatGPT 与人类判断具有相似的相关性。