Expressive speech-to-speech translation (S2ST) aims to transfer prosodic
attributes of source speech to target speech while maintaining translation
accuracy. Existing research in expressive S2ST is limited, typically focusing
on a single expressivity aspect at a time. Likewise, this research area lacks
standard evaluation protocols and well-curated benchmark datasets. In this
work, we propose a holistic cascade system for expressive S2ST, combining
multiple prosody transfer techniques previously considered only in isolation.
We curate a benchmark expressivity test set in the TV series domain and
explored a second dataset in the audiobook domain. Finally, we present a human
evaluation protocol to assess multiple expressive dimensions across speech
pairs. Experimental results indicate that bi-lingual annotators can assess the
quality of expressive preservation in S2ST systems, and the holistic modeling
approach outperforms single-aspect systems. Audio samples can be accessed
through our demo webpage:
this https URL.

本文提出了一个将多个韵律转移技术综合起来的综合层次系统（holistic cascade system）来把源语言的情感转移到目标语言中。我们还建立了一个基准的情感测试集来评估多重情感维度。实验结果表明，这种综合建模方法优于单一方面的研究。

一种全面级联系统、基准测试和人类评估协议，用于表达性语音翻译

A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation

We establish THumB, a rubric-based human evaluation protocol for image
captioning models. Our scoring rubrics and their definitions are carefully
developed based on machine- and human-generated captions on the MSCOCO dataset.
Each caption is evaluated along two main dimensions in a tradeoff (precision
and recall) as well as other aspects that measure the text quality (fluency,
conciseness, and inclusive language). Our evaluations demonstrate several
critical problems of the current evaluation practice. Human-generated captions
show substantially higher quality than machine-generated ones, especially in
coverage of salient information (i.e., recall), while most automatic metrics
say the opposite. Our rubric-based results reveal that CLIPScore, a recent
metric that uses image features, better correlates with human judgments than
conventional text-only metrics because it is more sensitive to recall. We hope
that this work will promote a more transparent evaluation protocol for image
captioning and its automatic metrics.

本文介绍了一种基于机器和人生成的 MSCOCO 数据集上的图像标注模型的评估协议 THumB，用于评估图像文本的质量。我们的实验发现，使用图像特征的近期度量值 CLIPScore 更符合人类评判标准。

图像字幕的透明人工评估

Transparent Human Evaluation for Image Captioning

Sentences produced by abstractive summarization systems can be ungrammatical
and fail to preserve the original meanings, despite being locally fluent. In
this paper we propose to remedy this problem by jointly generating a sentence
and its syntactic dependency parse while performing abstraction. If generating
a word can introduce an erroneous relation to the summary, the behavior must be
discouraged. The proposed method thus holds promise for producing grammatical
sentences and encouraging the summary to stay true-to-original. Our
contributions of this work are twofold. First, we present a novel neural
architecture for abstractive summarization that combines a sequential decoder
with a tree-based decoder in a synchronized manner to generate a summary
sentence and its syntactic parse. Secondly, we describe a novel human
evaluation protocol to assess if, and to what extent, a summary remains true to
its original meanings. We evaluate our method on a number of summarization
datasets and demonstrate competitive results against strong baselines.

本文提出了一种新颖的神经网络架构用于抽象概括及句法解析的同时生成摘要，同时还描述了一种新颖的人工评估协议来评估摘要是否符合原始含义，经评估证明该方法在多个摘要数据集上与强基线相比表现出有竞争力的结果。