object tracking can be formulated as "finding the right object in a video".
We observe that recent approaches for class-agnostic tracking tend to focus on
the "finding" part, but largely overlook the "object" part of the task,
essentially doing a template matching over a frame in a sli