The large language model era urges faster and less costly inference. Prior
model compression works on LLMs tend to undertake a software-centric approach
primarily focused on the simulated quantization performance. By neglecting the
feasibility of deployment, these approaches are typically disabled in real
practice. They used to drastically push down the quantization bit range for a
reduced computation which might not be supported by the mainstream hardware, or
involve sophisticated algorithms that introduce extra computation or memory
access overhead. We argue that pursuing a hardware-centric approach in the
construction of quantization algorithms is crucial. In this regard, we are
driven to build our compression method on top of hardware awareness,
eliminating impractical algorithm choices while maximizing the benefit of
hardware acceleration. Our method, OdysseyLLM, comes with a novel W4A8 kernel
implementation called FastGEMM and a combined recipe of quantization
strategies. Extensive experiments manifest the superiority of our W4A8 method
which brings the actual speed boosting up to \textbf{4$\times$} compared to
Hugging Face FP16 inference and \textbf{2.23$\times$} vs. the state-of-the-art
inference engine TensorRT-LLM in FP16, and \textbf{1.45$\times$} vs.
TensorRT-LLM in INT8, yet without substantially harming the performance.

通过硬件为中心的方法，我们的压缩方法在硬件加速的基础上构建了一种新的 W4A8 内核实现，具有量化策略的综合配方，通过广泛的实验证明了我们的 W4A8 方法对于 Hugging Face FP16 推断的实际加速效果为 4 倍，对于 TensorRT-LLM 推断引擎的 FP16 加速效果为 2.23 倍，对于 TensorRT-LLM 推断引擎的 INT8 加速效果为 1.45 倍，且不会对性能造成实质性的损害。