The pre-training of masked language models (MLMs) consumes massive
computation to achieve good results on downstream NLP tasks, resulting in a
large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as
placeholders and gather the contextualized information from unmasked tokens to
restore the corrupted information. It raises the question of whether we can
append [MASK]s at a later layer, to reduce the sequence length for earlier
layers and make the pre-training more efficient. We show: (1) [MASK]s can
indeed be appended at a later layer, being disentangled from the word
embedding; (2) The gathering of contextualized information from unmasked tokens
can be conducted with a few layers. By further increasing the masking rate from
15% to 50%, we can pre-train RoBERTa-base and RoBERTa-large from scratch with
only 78% and 68% of the original computational budget without any degradation
on the GLUE benchmark. When pre-training with the original budget, our method
outperforms RoBERTa for 6 out of 8 GLUE tasks, on average by 0.4%.

在预训练过程中追加 [MASK] 可以降低较早层的序列长度，从而在减少计算预算的前提下，提高 RoBERTa 模型的预训练效率，同时在 GLUE 基准测试中表现更好。

面具更多，面具更晚：通过分解 [MASK] 令牌实现有效的遮蔽语言模型预训练

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

As an important step towards visual reasoning, visual grounding (e.g., phrase
localization, referring expression comprehension/segmentation) has been widely
explored Previous approaches to referring expression comprehension (REC) or
segmentation (RES) either suffer from limited performance, due to a two-stage
setup, or require the designing of complex task-specific one-stage
architectures. In this paper, we propose a simple one-stage multi-task
framework for visual grounding tasks. Specifically, we leverage a transformer
architecture, where two modalities are fused in a visual-lingual encoder. In
the decoder, the model learns to generate contextualized lingual queries which
are then decoded and used to directly regress the bounding box and produce a
segmentation mask for the corresponding referred regions. With this simple but
highly contextualized model, we outperform state-of-the-arts methods by a large
margin on both REC and RES tasks. We also show that a simple pre-training
schedule (on an external dataset) further improves the performance. Extensive
experiments and ablations illustrate that our model benefits greatly from
contextualized information and multi-task training.

本次研究提出了一个基于 transformer 架构的单阶段多任务模型，通过融合视觉和语言输入，实现了高度语义转换的视觉语言解析，通过上下文信息和多任务学习，该模型在包括命名实体识别等任务上，取得了比现有方法更加突出的性能优势。

引用变压器：一种多任务视觉基础的一步方法

Referring Transformer: A One-step Approach to Multi-task Visual  Grounding

Matching natural language sentences is central for many applications such as
information retrieval and question answering. Existing deep models rely on a
single sentence representation or multiple granularity representations for
matching. However, such methods cannot well capture the contextualized local
information in the matching process. To tackle this problem, we present a new
deep architecture to match two sentences with multiple positional sentence
representations. Specifically, each positional sentence representation is a
sentence representation at this position, generated by a bidirectional long
short term memory (Bi-LSTM). The matching score is finally produced by
aggregating interactions between these different positional sentence
representations, through $k$-Max pooling and a multi-layer perceptron. Our
model has several advantages: (1) By using Bi-LSTM, rich context of the whole
sentence is leveraged to capture the contextualized local information in each
positional sentence representation; (2) By matching with multiple positional
sentence representations, it is flexible to aggregate different important
contextualized local information in a sentence to support the matching; (3)
Experiments on different tasks such as question answering and sentence
completion demonstrate the superiority of our model.

本文介绍了一种双向长短时记忆网络（Bi-LSTM）生成的多重位置句子表示来匹配两个句子的新型深度体系结构。 实验证明，该模型具有丰富的整个句子上下文信息和灵活性，可以捕捉句子中不同的重要局部信息来支持匹配。