docs/features/quantization/llm_compressor.md
LLM Compressor is a library for optimizing models for deployment with vLLM. It provides a comprehensive set of quantization algorithms, including support for techniques such as FP4, FP8, INT8, and INT4 quantization.
Modern LLMs often contain billions of parameters stored in 16-bit or 32-bit floating point, requiring substantial GPU memory and limiting deployment options. Quantization lowers memory requirements while maintaining inference output quality by reducing the precision of model weights and activations to smaller data types.
LLM Compressor provides the following benefits:
LLM Compressor handles the complexity of quantization, calibration, and format conversion, producing models ready for immediate use with vLLM.