The mechanisms behind the success of multi-view self-supervised learning
(MVSSL) are not yet fully understood. Contrastive MVSSL methods have been
studied through the lens of InfoNCE, a lower bound of the Mutual Information
(MI). However, the relation between other MVSSL methods and MI remains unclear.
We consider a different lower bound on the MI consisting of an entropy and a
reconstruction term (ER), and analyze the main MVSSL families through its lens.
Through this ER bound, we show that clustering-based methods such as
DeepCluster and SwAV maximize the MI. We also re-interpret the mechanisms of
distillation-based approaches such as BYOL and DINO, showing that they
explicitly maximize the reconstruction term and implicitly encourage a stable
entropy, and we confirm this empirically. We show that replacing the objectives
of common MVSSL methods with this ER bound achieves competitive performance,
while making them stable when training with smaller batch sizes or smaller
exponential moving average (EMA) coefficients.
Github repo: this https URL

多视角自监督学习的成功机制尚未完全了解，本文通过熵和重构项 (ER) 的下界进行分析，发现基于聚类的方法最大化了互信息 (MI)，而基于蒸馏的方法则显式地最大化了重构项并隐式地鼓励稳定熵，通过用 ER 下界替换常见 MVSSL 方法的目标，实现了竞争性的性能，并在小批量大小或小指数移动平均 (EMA) 系数下保持稳定。

多视角自监督学习中的熵和重构的作用

The Role of Entropy and Reconstruction in Multi-View Self-Supervised  Learning

Pretrained language models (PTLMs) are typically learned over a large, static
corpus and further fine-tuned for various downstream tasks. However, when
deployed in the real world, a PTLM-based model must deal with data
distributions that deviate from what the PTLM was initially trained on. In this
paper, we study a lifelong language model pretraining challenge where a PTLM is
continually updated so as to adapt to emerging data. Over a domain-incremental
research paper stream and a chronologically-ordered tweet stream, we
incrementally pretrain a PTLM with different continual learning algorithms, and
keep track of the downstream task performance (after fine-tuning). We evaluate
PTLM's ability to adapt to new corpora while retaining learned knowledge in
earlier corpora. Our experiments show distillation-based approaches to be most
effective in retaining downstream performance in earlier domains. The
algorithms also improve knowledge transfer, allowing models to achieve better
downstream performance over the latest data, and improve temporal
generalization when distribution gaps exist between training and evaluation
because of time. We believe our problem formulation, methods, and analysis will
inspire future studies towards continual pretraining of language models.

本研究通过使用不同的持续学习算法对预先训练的语言模型进行不断的增量预训练，并通过评估模型对新数据的适应能力以及对早期数据所学知识的保留能力来研究生命周期语言模型预训练挑战，结果表明采用基于蒸馏的方法可以最有效地保留早期领域的下游任务性能。这些算法还可以提高知识转移能力，使模型在最新数据上实现更好的下游性能，并在由于时间而存在训练和评估之间的分布差异时，提高时态的泛化能力。