The quadratic computational complexity in the self-attention mechanism of popular transformer architectures poses significant challenges for training and inference, particularly in terms of Efficiency and memory requirements. Towards addressing these challenges, this paper introduces a