AbstractFar beyond learning long-range interactions of natural language,
transformers are becoming the de-facto standard for many vision tasks with their power and scalabilty. Especially with cross-modal tasks between image and text, vector quantized variational
→