LLM Compressor

LLM Compressor is a library for optimizing models for deployment with vLLM. It provides a comprehensive set of quantization algorithms, including support for techniques such as FP4, FP8, INT8, and INT4 quantization.

Why use LLM Compressor?

Modern LLMs often contain billions of parameters stored in 16-bit or 32-bit floating point, requiring substantial GPU memory and limiting deployment options. Quantization lowers memory requirements while maintaining inference output quality by reducing the precision of model weights and activations to smaller data types.

LLM Compressor provides the following benefits:

Reduced memory footprint: Run larger models on smaller GPUs.
Lower inference costs: Serve more concurrent users per GPU, directly reducing the cost per query in production deployments.
Faster inference: Smaller data types mean less memory bandwidth consumed, which often translates to higher throughput, especially for memory-bound workloads.

LLM Compressor handles the complexity of quantization, calibration, and format conversion, producing models ready for immediate use with vLLM.

Key features

Multiple Quantization Algorithms: Support for AWQ, GPTQ, AutoRound, and Round-to-Nearest. Also includes support for QuIP and SpinQuant-style transforms as well as KV cache and attention quantization.
Multiple Quantization Methods: Support for FP8, INT8, INT4, NVFP4, MXFP4, and mixed-precision quantization
One-Shot Quantization: Quantize models quickly with minimal calibration data
vLLM Integration: Seamlessly deploy quantized models with vLLM using the compressed-tensors format
Hugging Face Compatibility: Works with models from the Hugging Face Hub

LLM Compressor

LLM Compressor

Why use LLM Compressor?

Key features

Resources