The Mixture of Experts (MoE) for language models has been proven effective in
augmenting the capacity of models by dynamically routing each input token to a
specific subset of experts for processing. Despite the success, most existing
methods face a challenge for balance between sparsity and the availability of
expert knowledge: enhancing performance through increased use of expert
knowledge often results in diminishing sparsity during expert selection. To
mitigate this contradiction, we propose HyperMoE, a novel MoE framework built
upon Hypernetworks. This framework integrates the computational processes of
MoE with the concept of knowledge transferring in multi-task learning. Specific
modules generated based on the information of unselected experts serve as
supplementary information, which allows the knowledge of experts not selected
to be used while maintaining selection sparsity. Our comprehensive empirical
evaluations across multiple datasets and backbones establish that HyperMoE
significantly outperforms existing MoE methods under identical conditions
concerning the number of experts.

HyperMoE 是一种基于 Hypernetworks 的新型 Mixture of Experts (MoE) 框架，通过利用未选择的专家生成的特定模块作为补充信息，实现在保持选择稀疏性的同时利用未选择的专家的知识，从而在相同条件下显著优于现有 MoE 方法。

HyperMoE: 通过专家之间的迁移改进更好的专家混合

HyperMoE: Towards Better Mixture of Experts via Transferring Among  Experts

Image-text pretrained models, e.g., CLIP, have shown impressive general
multi-modal knowledge learned from large-scale image-text data pairs, thus
attracting increasing attention for their potential to improve visual
representation learning in the video domain. In this paper, based on the CLIP
model, we revisit temporal modeling in the context of image-to-video knowledge
transferring, which is the key point for extending image-text pretrained models
to the video domain. We find that current temporal modeling mechanisms are
tailored to either high-level semantic-dominant tasks (e.g., retrieval) or
low-level visual pattern-dominant tasks (e.g., recognition), and fail to work
on the two cases simultaneously. The key difficulty lies in modeling temporal
dependency while taking advantage of both high-level and low-level knowledge in
CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary
Network (STAN) -- a simple and effective temporal modeling mechanism extending
CLIP model to diverse video tasks. Specifically, to realize both low-level and
high-level knowledge transferring, STAN adopts a branch structure with
decomposed spatial-temporal modules that enable multi-level CLIP features to be
spatial-temporally contextualized. We evaluate our method on two representative
video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments
demonstrate the superiority of our model over the state-of-the-art methods on
various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and
Something-Something-V2. Codes will be available at
this https URL

本论文基于 CLIP 模型，提出了一种名为 STAN 的时空建模机制，用于将图像 - 文本预训练模型扩展到视频领域，并在视频文本检索和视频识别等多项任务中展现了其优越性。

重新审视基于 CLIP 的图像到视频知识传递的时间建模

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

Non-autoregressive automatic speech recognition (ASR) modeling has received
increasing attention recently because of its fast decoding speed and superior
performance. Among representatives, methods based on the connectionist temporal
classification (CTC) are still a dominating stream. However, the theoretically
inherent flaw, the assumption of independence between tokens, creates a
performance barrier for the school of works. To mitigate the challenge, we
propose a context-aware knowledge transferring strategy, consisting of a
knowledge transferring module and a context-aware training strategy, for
CTC-based ASR. The former is designed to distill linguistic information from a
pre-trained language model, and the latter is framed to modulate the
limitations caused by the conditional independence assumption. As a result, a
knowledge-injected context-aware CTC-based ASR built upon the wav2vec2.0 is
presented in this paper. A series of experiments on the AISHELL-1 and AISHELL-2
datasets demonstrate the effectiveness of the proposed method.

该研究利用一种上下文感知的知识传递策略为基于 CTC 的自动语音识别模型注入语言学信息，提高了其性能表现，通过实验证明了该方法在 AISHELL-1 和 AISHELL-2 数据集上的有效性。

基于 CTC 的 ASR 的上下文感知知识迁移策略

A context-aware knowledge transferring strategy for CTC-based ASR

The networks trained on the long-tailed dataset vary remarkably, despite the
same training settings, which shows the great uncertainty in long-tailed
learning. To alleviate the uncertainty, we propose a Nested Collaborative
Learning (NCL), which tackles the problem by collaboratively learning multiple
experts together. NCL consists of two core components, namely Nested Individual
Learning (NIL) and Nested Balanced Online Distillation (NBOD), which focus on
the individual supervised learning for each single expert and the knowledge
transferring among multiple experts, respectively. To learn representations
more thoroughly, both NIL and NBOD are formulated in a nested way, in which the
learning is conducted on not just all categories from a full perspective but
some hard categories from a partial perspective. Regarding the learning in the
partial perspective, we specifically select the negative categories with high
predicted scores as the hard categories by using a proposed Hard Category
Mining (HCM). In the NCL, the learning from two perspectives is nested, highly
related and complementary, and helps the network to capture not only global and
robust features but also meticulous distinguishing ability. Moreover,
self-supervision is further utilized for feature enhancement. Extensive
experiments manifest the superiority of our method with outperforming the
state-of-the-art whether by using a single model or an ensemble.

本文提出一种名为 Nested Collaborative Learning (NCL) 的方法，采用 Nested Individual Learning (NIL) 和 Nested Balanced Online Distillation (NBOD) 两种核心组件，通过从全局和部分视角学习难分类来解决类别不平衡带来的不确定性，进而实现在该领域性能最优的结果。