The increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA's Sparse Tensor Cores (SPTCs), which promise a 2x speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparse-library for DL routines. We show that Spatha achieves up to 37x speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.

逐渐成功和扩大规模的深度学习模型对计算效率和功耗提出了更高的要求。稀疏化能够导致模型更小、计算效率更高，并且加速硬件已经得到应用。本文提出了一种新的格式V:N:M，用于在NVIDIA的Sparse Tensor Cores上执行任意N:M比例的稀疏化计算，并通过高性能稀疏库Spatha实现了高达37倍的加速，在现代transformers中实现高稀疏度而几乎不降低准确性的二阶裁剪技术。

VENOM：一种向量化的N:M格式，释放稀疏张量核心的能量