Recently, Mixture-of-Experts (short as MoE) architecture has achieved
remarkable success in increasing the model capacity of large-scale language
models. However, MoE requires incorporating significantly more parameters than
the base model being extended. In this paper, we propose building a
parameter-efficient MoE architecture by sharing information among experts. We
adopt the matrix product operator (MPO, a tensor decomposition from quantum
many-body physics) to reconstruct the parameter matrix in the expert layer and
increase model capacity for pre-trained language models by sharing parameters
of the central tensor (containing the core information) among different experts
while enabling the specificity through the auxiliary tensors (complementing the
central tensor) of different experts. To address the unbalanced optimization
issue, we further design the gradient mask strategy for the MPO-based MoE
architecture. Extensive experiments based on T5 and GPT-2 show improved
performance and efficiency of the pre-trained language model (27.2x reduction
in total parameters for the superior model performance, compared with the
Switch Transformers). Our code is publicly available at
this https URL.

本文提出了一种参数高效的混合专家架构，通过在专家层中共享参数矩阵中心张量的信息，并通过辅助张量增加各个专家的特异性，从而通过矩阵乘积算子来实现受量子多体物理学影响的张量分解，以解决混合专家架构中存在的参数膨胀问题，实验结果表明新方法具有更好的性能和效率。