Unsupervised learning objectives like language modeling and de-noising
constitute a significant part in producing pre-trained models that perform
various downstream applications from natural language understanding to
conversational tasks. However, despite impressive conversational capabilities
of recent large language model, their abilities to capture syntactic or
semantic structure within text lag behind. We hypothesize that the mismatch
between linguistic performance and competence in machines is attributable to
insufficient transfer of linguistic structure knowledge to computational
systems with currently popular pre-training objectives. We show that
punctuation restoration transfers to improvements in in- and
out-of-distribution performance on structure-related tasks like named entity
recognition, open information extraction, chunking, and part-of-speech tagging.
Punctuation restoration is an effective learning objective that can improve
structure understanding and yield a more robust structure-aware representations
of natural language.

非监督学习目标如语言建模和去噪在产生预训练模型方面扮演重要角色，然而，最近的大语言模型的对话能力令人印象深刻，但它们在捕捉文本内的句法或语义结构方面滞后。我们假设这种语言性能和机器能力之间的差异是由于目前流行的预训练目标对语言结构知识的转移不足引起的。我们表明，标点恢复可以提高与结构相关的任务的内、外分布性能，如命名实体识别、开放信息提取、块划分和词性标注。标点恢复是一种有效的学习目标，可以改善结构理解并产生更健壮的结构感知的自然语言表示。

标点恢复提升结构理解无需监督

Punctuation Restoration Improves Structure Understanding without  Supervision

This paper presents a unified Vision-Language Pre-training (VLP) model. The
model is unified in that (1) it can be fine-tuned for either vision-language
generation (e.g., image captioning) or understanding (e.g., visual question
answering) tasks, and (2) it uses a shared multi-layer transformer network for
both encoding and decoding, which differs from many existing methods where the
encoder and decoder are implemented using separate models. The unified VLP
model is pre-trained on a large amount of image-text pairs using the
unsupervised learning objectives of two tasks: bidirectional and
sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks
differ solely in what context the prediction conditions on. This is controlled
by utilizing specific self-attention masks for the shared transformer network.
To the best of our knowledge, VLP is the first reported model that achieves
state-of-the-art results on both vision-language generation and understanding
tasks, as disparate as image captioning and visual question answering, across
three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and
VQA 2.0. The code and the pre-trained models are available at
this https URL

本文提出了一个统一的视觉语言预训练模型，采用共享的多层 Transformer 网络进行编码和解码，通过两个任务的无监督学习目标对大量的图像文本对进行预训练，使得该模型在图像字幕和视觉问答等多个任务上均取得了最先进的结果。