With the fast growth of parameter size, it becomes increasingly challenging
to deploy large generative models as they typically require large GPU memory
consumption and massive computation. Unstructured model pruning has been a
common approach to reduce both GPU memory footprint and the overall computation
while retaining good model accuracy. However, the existing solutions do not
provide a highly-efficient support for handling unstructured sparsity on modern
GPUs, especially on the highly-structured Tensor Core hardware. Therefore, we
propose Flash-LLM for enabling low-cost and highly-efficient large generative
model inference with the sophisticated support of unstructured sparsity on
high-performance but highly restrictive Tensor Cores. Based on our key
observation that the main bottleneck of generative model inference is the
several skinny matrix multiplications for which Tensor Cores would be
significantly under-utilized due to low computational intensity, we propose a
general Load-as-Sparse and Compute-as-Dense methodology for unstructured sparse
matrix multiplication. The basic insight is to address the significant memory
bandwidth bottleneck while tolerating redundant computations that are not
critical for end-to-end performance on Tensor Cores. Based on this, we design
an effective software framework for Tensor Core based unstructured SpMM,
leveraging on-chip resources for efficient sparse data extraction and
computation/memory-access overlapping. At SpMM kernel level, Flash-LLM
significantly outperforms the state-of-the-art library, i.e., Sputnik and
SparTA by an average of 2.9x and 1.5x, respectively. At end-to-end framework
level on OPT-30B/66B/175B models, for tokens per GPU-second, Flash-LLM achieves
up to 3.8x and 3.6x improvement over DeepSpeed and FasterTransformer,
respectively, with significantly lower inference cost.

Flash-LLM 是一种针对大型生成模型的低成本高效大规模推断框架，通过优化稀疏矩阵乘法，在高性能 Tensor Cores 上实现了显著的性能提升。