We observe that incorporating a shared layer in a mixture-of-experts can lead
to performance degradation. This leads us to hypothesize that learning shared
features poses challenges in deep learning, potentially caused by the same
feature being learned as various different features. To address this issue, we
track each expert's usage frequency and merge the two most frequently selected
experts. We then update the least frequently selected expert using the
combination of experts. This approach, combined with the subsequent learning of
the router's expert selection, allows the model to determine if the most
frequently selected experts have learned the same feature differently. If they
have, the combined expert can be further trained to learn a more general
feature. Consequently, our algorithm enhances transfer learning and mitigates
catastrophic forgetting when applied to multi-domain task incremental learning.

通过在混合专家中引入共享层，我们观察到性能下降。为了解决这个问题，我们跟踪每个专家的使用频率并合并两个最常选择的专家，然后使用专家组合来更新最不常选择的专家。结合后续的路由器专家选择学习，我们的算法在多领域任务增量学习中提高了迁移学习并缓解了灾难性遗忘。