Egocentric temporal action segmentation in videos is a crucial task in
computer vision with applications in various fields such as mixed reality,
human behavior analysis, and robotics. Although recent research has utilized
advanced visual-language frameworks, transformers remain the backbone of action
segmentation models. Therefore, it is necessary to improve transformers to
enhance the robustness of action segmentation models. In this work, we propose
two novel ideas to enhance the state-of-the-art transformer for action
segmentation. First, we introduce a dual dilated attention mechanism to
adaptively capture hierarchical representations in both local-to-global and
global-to-local contexts. Second, we incorporate cross-connections between the
encoder and decoder blocks to prevent the loss of local context by the decoder.
Additionally, we utilize state-of-the-art visual-language representation
learning techniques to extract richer and more compact features for our
transformer. Our proposed approach outperforms other state-of-the-art methods
on the Georgia Tech Egocentric Activities (GTEA) and HOI4D Office Tools
datasets, and we validate our introduced components with ablation studies. The
source code and supplementary materials are publicly available on
this https URL

本研究提出了两种新颖的想法来增强用于行动分割的 transformer，第一，我们引入了双扩张注意机制来自适应地捕获局部到全局和全局到局部上下文中的分层表示。第二，我们在编码器和解码器块之间加入跨连接以防止解码器丢失局部上下文。此外，我们利用最先进的视觉语言表示学习技术为我们的 transformer 提取更丰富，更紧凑的特征。我们的方法在 Georgia Tech Egocentric Activities (GTEA) 和 HOI4D Office Tools 数据集上优于其他最先进的方法，并且我们通过消融实验验证了我们介绍的组件。我们的源代码和补充材料公开可用。