Authorship verification is the problem of determining if two distinct writing
samples share the same author and is typically concerned with the attribution
of written text. In this paper, we explore the attribution of transcribed
speech, which poses novel challenges. The main challenge is that many stylistic
features, such as punctuation and capitalization, are not available or
reliable. Therefore, we expect a priori that transcribed speech is a more
challenging domain for attribution. On the other hand, other stylistic
features, such as speech disfluencies, may enable more successful attribution
but, being specific to speech, require special purpose models. To better
understand the challenges of this setting, we contribute the first systematic
study of speaker attribution based solely on transcribed speech. Specifically,
we propose a new benchmark for speaker attribution focused on conversational
speech transcripts. To control for spurious associations of speakers with
topic, we employ both conversation prompts and speakers' participating in the
same conversation to construct challenging verification trials of varying
difficulties. We establish the state of the art on this new benchmark by
comparing a suite of neural and non-neural baselines, finding that although
written text attribution models achieve surprisingly good performance in
certain settings, they struggle in the hardest settings we consider.

论文通过研究转写语音来探讨作者识别的问题，重点在于解决转写语音中特有的挑战性，包括控制主题相关性和基于转写语音构建的 speaker attribution 基准测试。通过与神经网络和非神经网络模型进行对比，发现尽管书面文本的作者识别模型在某些情况下表现出令人惊讶的性能，但在考虑的最难的情景中仍然存在困难。

作者归属模型能否识别演讲记录中的讲话者？

Can Authorship Attribution Models Distinguish Speakers in Speech  Transcripts?

A widespread approach to processing spoken language is to first automatically
transcribe it into text. An alternative is to use an end-to-end approach:
recent works have proposed to learn semantic embeddings of spoken language from
images with spoken captions, without an intermediate transcription step. We
propose to use multitask learning to exploit existing transcribed speech within
the end-to-end setting. We describe a three-task architecture which combines
the objectives of matching spoken captions with corresponding images, speech
with text, and text with images. We show that the addition of the speech/text
task leads to substantial performance improvements on image retrieval when
compared to training the speech/image task in isolation. We conjecture that
this is due to a strong inductive bias transcribed speech provides to the
model, and offer supporting evidence for this.

本文讨论了一种利用多任务学习的方式，在端到端的语言处理中利用已有的转录发音从而带来图像检索表现的一个显著的提高，这是由于转录发音为模型提供了很强的归纳偏置，这些是通过匹配发音字幕、语音和文本、以及文本和图像等三个任务来实现的。