Establishing correspondence between images or scenes is a significant
challenge in computer vision, especially given occlusions, viewpoint changes,
and varying object appearances. In this paper, we present Siamese Masked
Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for
learning visual correspondence from videos. SiamMAE operates on pairs of
randomly sampled video frames and asymmetrically masks them. These frames are
processed independently by an encoder network, and a decoder composed of a
sequence of cross-attention layers is tasked with predicting the missing
patches in the future frame. By masking a large fraction ($95\%$) of patches in
the future frame while leaving the past frame unchanged, SiamMAE encourages the
network to focus on object motion and learn object-centric representations.
Despite its conceptual simplicity, features learned via SiamMAE outperform
state-of-the-art self-supervised methods on video object segmentation, pose
keypoint propagation, and semantic part propagation tasks. SiamMAE achieves
competitive results without relying on data augmentation, handcrafted
tracking-based pretext tasks, or other techniques to prevent representational
collapse.

本文提出了基于 SiamMAE 的 Siamese Masked Autoencoders 方法，使用视频学习视觉对应关系，通过对大量补丁进行遮罩，鼓励网络集中学习运动对象和学习以对象为中心的表示。该方法可以在不依赖数据增强或用于防止表示崩溃的手工制作跟踪先兆任务或其他技术的情况下，实现与先前的自我监督方法相比更好的表现。

孪生掩模自编码器

Siamese Masked Autoencoders

Most work on temporal action detection is formulated as an offline problem,
in which the start and end times of actions are determined after the entire
video is fully observed. However, important real-time applications including
surveillance and driver assistance systems require identifying actions as soon
as each video frame arrives, based only on current and historical observations.
In this paper, we propose a novel framework, Temporal Recurrent Network (TRN),
to model greater temporal context of a video frame by simultaneously performing
online action detection and anticipation of the immediate future. At each
moment in time, our approach makes use of both accumulated historical evidence
and predicted future information to better recognize the action that is
currently occurring, and integrates both of these into a unified end-to-end
architecture. We evaluate our approach on two popular online action detection
datasets, HDD and TVSeries, as well as another widely used dataset, THUMOS'14.
The results show that TRN significantly outperforms the state-of-the-art.

本文提出了一种新颖的框架 Temporal Recurrent Network (TRN) 来模拟视频帧的时间上下文，在线执行行动检测并预测即将发生的行动，实现了累积历史证据和预测未来信息相结合的在线识别方式，并在 HDD、TVSeries 和 THUMOS'14 三个数据集上进行评估，表明 TRN 的性能显著优于现有技术。