Locating an object in a sequence of frames, given its appearance in the first
frame of the sequence, is a hard problem that involves many stages. Usually,
state-of-the-art methods focus on bringing novel ideas in the visual encoding
or relational modelling phases. However, in this work, we show that bounding
box regression from learned joint search and templ