Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges due to the high compute and memory requirements stemming from the enormous model size and the difficulty of running it in the integer pipelines. In this paper, we present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision. Based on our analysis of outlier values in LLMs, we propose a decomposed quantization technique in which the scale factors of decomposed matrices are powers of two apart. The proposed scheme allows us to avoid explicit requantization (i.e., dequantization/quantization) when accumulating the partial sums from the decomposed matrices, with a minimal extension to the commodity tensor compute hardware. Our evaluation shows that Tender achieves higher accuracy and inference performance compared to the state-of-the-art methods while also being significantly less intrusive to the existing accelerators.

基于算法和硬件协同设计的解决方案Tender，能够以低精度有效部署LLM推理，通过分析LLMs中的异常值，提出了一种分解的量化技术，其分解矩阵的尺度因子相隔为二的幂，该方案避免了显式的重新量化，并且在现有加速器中具有更高的准确性和推理性能，同时降低了干扰。

通过张量分解和运行时重新定量化加速大型语言模型