In this paper, we present a new tracking architecture with an encoder-decoder
transformer as the key component. The encoder models the global spatio-temporal
feature dependencies between target objects and search regions, while the
decoder learns a query embedding to predict the spatia