Human education system trains one student by multiple experts.
mixture-of-experts (MoE) is a powerful sparse architecture including multiple
experts. However, sparse MoE model is easy to overfit, hard to deploy, and not
hardware-friendly for practitioners. In this work, inspired by the