In this paper, we propose a highly parameter-efficient approach to scaling
pre-trained language models (PLMs) to a deeper model depth. Unlike prior work
that shares all parameters or uses extra blocks, we design a more capable
parameter-sharing architecture based on matrix product operator (MPO). MPO
decomposition can reorganize and factorize the information of a parameter
matrix into two parts: the major part that contains the major information
(central tensor) and the supplementary part that only has a small proportion of
parameters (auxiliary tensors). Based on such a decomposition, our architecture
shares the central tensor across all layers for reducing the model size and
meanwhile keeps layer-specific auxiliary tensors (also using adapters) for
enhancing the adaptation flexibility. To improve the model training, we further
propose a stable initialization algorithm tailored for the MPO-based
architecture. Extensive experiments have demonstrated the effectiveness of our
proposed model in reducing the model size and achieving highly competitive
performance.

本文提出一种基于 MPO 分解的高度参数化效率的方法，可将预训练语言模型（PLMs）扩展到更深的模型深度，并通过共享主要信息和保留层特定辅助信息的组合，实现模型大小的降低和性能提升。