Training a classifier over a large number of classes, known as 'extreme
classification', has become a topic of major interest with applications in
technology, science, and e-commerce. Traditional softmax regression induces a
gradient cost proportional to the number of classes $C$, whic