Federated Learning (FL) is transforming the ML training ecosystem from a
centralized over-the-cloud setting to distributed training over edge devices in
order to strengthen data privacy. An essential but rarely studied challenge in
FL is label deficiency at the edge. This problem is even more pronounced in FL
compared to centralized training due to the fact that FL users are often
reluctant to label their private data. Furthermore, due to the heterogeneous
nature of the data at edge devices, it is crucial to develop personalized
models. In this paper we propose self-supervised federated learning (SSFL), a
unified self-supervised and personalized federated learning framework, and a
series of algorithms under this framework which work towards addressing these
challenges. First, under the SSFL framework, we demonstrate that the standard
FedAvg algorithm is compatible with recent breakthroughs in centralized
self-supervised learning such as SimSiam networks. Moreover, to deal with data
heterogeneity at the edge devices in this framework, we have innovated a series
of algorithms that broaden existing supervised personalization algorithms into
the setting of self-supervised learning. We further propose a novel
personalized federated self-supervised learning algorithm, Per-SSFL, which
balances personalization and consensus by carefully regulating the distance
between the local and global representations of data. To provide a
comprehensive comparative analysis of all proposed algorithms, we also develop
a distributed training system and related evaluation protocol for SSFL. Our
findings show that the gap of evaluation accuracy between supervised learning
and unsupervised learning in FL is both small and reasonable. The performance
comparison indicates the representation regularization-based personalization
method is able to outperform other variants.

本文提出自监督联邦学习框架（SSFL），包括标签不足、数据异构性等挑战，并提出一系列算法，如 Per-SSFL、FedAvg 和 SimSiam 等。作者还开发了一个分布式训练系统和相关评估协议，发现监督和非监督学习之间的性能差距小。

SSFL: 通过个性化自监督解决联合学习中的标签不足问题

SSFL: Tackling Label Deficiency in Federated Learning via Personalized  Self-Supervision

Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of
language model to trillions of parameters. However, training trillion-scale MoE
requires algorithm and system co-design for a well-tuned high performance
distributed training system. Unfortunately, the only existing platform that
meets the requirements strongly depends on Google's hardware (TPU) and software
(Mesh Tensorflow) stack, and is not open and available to the public,
especially GPU and PyTorch communities.
In this paper, we present FastMoE, a distributed MoE training system based on
PyTorch with common accelerators. The system provides a hierarchical interface
for both flexible model design and easy adaption to different applications,
such as Transformer-XL and Megatron-LM. Different from direct implementation of
MoE models using PyTorch, the training speed is highly optimized in FastMoE by
sophisticated high-performance acceleration skills. The system supports placing
different experts on multiple GPUs across multiple nodes, enabling enlarging
the number of experts linearly against the number of GPUs. The source of
FastMoE is available at this https URL under Apache-2
license.

本文提出了基于 PyTorch 的 FastMoE 分布式混合专家 (Mixture-of-Expert) 训练系统，支持多 GPU 节点上放置不同的专家，通过高效的加速技术实现高速训练，并提供了灵活的模型设计和适应性，适用于 Transformer-XL 和 Megatron-LM 等不同的应用程序

FastMoE: 快速混合专家训练系统

FastMoE: A Fast Mixture-of-Expert Training System

Word2vec is a popular family of algorithms for unsupervised training of dense
vector representations of words on large text corpuses. The resulting vectors
have been shown to capture semantic relationships among their corresponding
words, and have shown promise in reducing a number of natural language
processing (NLP) tasks to mathematical operations on these vectors. While
heretofore applications of word2vec have centered around vocabularies with a
few million words, wherein the vocabulary is the set of words for which vectors
are simultaneously trained, novel applications are emerging in areas outside of
NLP with vocabularies comprising several 100 million words. Existing word2vec
training systems are impractical for training such large vocabularies as they
either require that the vectors of all vocabulary words be stored in the memory
of a single server or suffer unacceptable training latency due to massive
network data transfer. In this paper, we present a novel distributed, parallel
training system that enables unprecedented practical training of vectors for
vocabularies with several 100 million words on a shared cluster of commodity
servers, using far less network traffic than the existing solutions. We
evaluate the proposed system on a benchmark dataset, showing that the quality
of vectors does not degrade relative to non-distributed training. Finally, for
several quarters, the system has been deployed for the purpose of matching
queries to ads in Gemini, the sponsored search advertising platform at Yahoo,
resulting in significant improvement of business metrics.

本文介绍了一种基于分布式并行训练的新型 Word2vec 算法，可以有效训练具有数亿个单词的大词汇量语料库的词向量表示，而不需要大量数据传输或单个服务器的存储。经实验证明，在 Gemini 广告投放平台实践中取得了显著的业务贡献。