Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen
TL;DR利用时间上下文提高了自我中心视频识别能力的基于转换器的多模态模型。
Abstract
In egocentric videos, actions occur in quick succession. We capitalise on the
action's temporal context and propose a method that learns to attend to
surrounding actions in order to improve recognition performanc