We propose a new two-stage pre-training framework for video-to-text
generation tasks such as video captioning and video question answering: A
generative encoder-decoder model is first jointly pre-trained on massive
image-text data to learn fundamental vision-language concepts, and then adapted
to video data in an intermediate video-text pre-training stage to learn
video-specific skills such as spatio-temporal reasoning. As a result, our
VideoOFA model achieves new state-of-the-art performance on four Video
Captioning benchmarks, beating prior art by an average of 9.7 points in CIDEr
score. It also outperforms existing models on two open-ended Video Question
Answering datasets, showcasing its generalization capability as a universal
video-to-text model.

该研究提出了一种新的两阶段预训练框架来生成视频描述和回答问题，称为 VideoOFA 模型，在大规模图像 - 文本数据上预先训练表示学习，然后在中间视频 - 文本预训练阶段仅适应于视频数据来学习时空推理等视频特定技能，这使得该模型在四个视频描述基准测试中实现了新的最优表现，并在两个开放式的视频问答数据集上优于现有模型，展示了其作为通用视频 - 文本模型的泛化能力。

VideoOFA: 为视频到文本生成进行的两阶段预训练

VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

Sign Language Translation (SLT) first uses a Sign Language Recognition (SLR)
system to extract sign language glosses from videos. Then, a translation system
generates spoken language translations from the sign language glosses. This
paper focuses on the translation system and introduces the STMC-Transformer
which improves on the current state-of-the-art by over 5 and 7 BLEU
respectively on gloss-to-text and video-to-text translation of the
PHOENIX-Weather 2014T dataset. On the ASLG-PC12 corpus, we report an increase
of over 16 BLEU.
We also demonstrate the problem in current methods that rely on gloss
supervision. The video-to-text translation of our STMC-Transformer outperforms
translation of GT glosses. This contradicts previous claims that GT gloss
translation acts as an upper bound for SLT performance and reveals that glosses
are an inefficient representation of sign language. For future SLT research, we
therefore suggest an end-to-end training of the recognition and translation
models, or using a different sign language annotation scheme.

本研究提出了 STMC-Transformer 翻译系统，相对当前最先进技术，在 PHOENIX-Weather 2014T 数据集的亮度 - 文本翻译和视频 - 文本翻译方面提高了 5 和 7 BLEU。在 ASLG-PC12 数据集上，也有超过 16 BLEU 的提高。同时，我们证明了当前方法中的问题，即依赖于 gloss 监督会导致 SLT 表现不佳，并揭示了 gloss 是手语的低效表示方法，因此建议未来的 SLT 研究采用端到端的训练方法或使用不同的手语注释方式。

基于 STMC-Transformer 的更好手语翻译

Better Sign Language Translation with STMC-Transformer

Automatic transcriptions of consumer-generated multi-media content such as
"Youtube" videos still exhibit high word error rates. Such data typically
occupies a very broad domain, has been recorded in challenging conditions, with
cheap hardware and a focus on the visual modality, and may have been
post-processed or edited. In this paper, we extend our earlier work on adapting
the acoustic model of a DNN-based speech recognition system to an RNN language
model and show how both can be adapted to the objects and scenes that can be
automatically detected in the video. We are working on a corpus of "how-to"
videos from the web, and the idea is that an object that can be seen ("car"),
or a scene that is being detected ("kitchen") can be used to condition both
models on the "context" of the recording, thereby reducing perplexity and
improving transcription. We achieve good improvements in both cases and compare
and analyze the respective reductions in word error rate. We expect that our
results can be used for any type of speech processing in which "context"
information is available, for example in robotics, man-machine interaction, or
when indexing large audio-visual archives, and should ultimately help to bring
together the "video-to-text" and "speech-to-text" communities.

该论文提出了一种基于 DNN 技术的语音识别系统及 RNN 语言模型来提高视频自动生成的字幕准确性，通过对视频中自动检测到的物体或场景的条件来减少困惑度和提高转录，可以应用于机器人、人机交互及音视频存档索引等领域。