Quantization - Localai

+++ disableToc = false title = "Model Quantization" weight = 19 url = '/features/quantization/' +++

LocalAI supports model quantization directly through the API and Web UI. Quantization converts HuggingFace models to GGUF format and compresses them to smaller sizes for efficient inference with llama.cpp.

{{% notice note %}} This feature is experimental and may change in future releases. {{% /notice %}}

Supported Backends

Backend	Description	Quantization Types	Platforms
llama-cpp-quantization	Downloads HF models, converts to GGUF, and quantizes using llama.cpp	q2_k, q3_k_s, q3_k_m, q3_k_l, q4_0, q4_k_s, q4_k_m, q5_0, q5_k_s, q5_k_m, q6_k, q8_0, f16	CPU (Linux, macOS)

Quick Start

1. Start a quantization job

bash

curl -X POST http://localhost:8080/api/quantization/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/functiongemma-270m-it",
    "quantization_type": "q4_k_m"
  }'

2. Monitor progress (SSE stream)

bash

curl -N http://localhost:8080/api/quantization/jobs/{job_id}/progress

3. Download the quantized model

bash

curl -o model.gguf http://localhost:8080/api/quantization/jobs/{job_id}/download

4. Or import it directly into LocalAI

bash

curl -X POST http://localhost:8080/api/quantization/jobs/{job_id}/import \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-quantized-model"
  }'

API Reference

Endpoints

Method	Path	Description
`POST`	`/api/quantization/jobs`	Start a quantization job
`GET`	`/api/quantization/jobs`	List all jobs
`GET`	`/api/quantization/jobs/:id`	Get job details
`POST`	`/api/quantization/jobs/:id/stop`	Stop a running job
`DELETE`	`/api/quantization/jobs/:id`	Delete a job and its data
`GET`	`/api/quantization/jobs/:id/progress`	SSE progress stream
`POST`	`/api/quantization/jobs/:id/import`	Import quantized model into LocalAI
`GET`	`/api/quantization/jobs/:id/download`	Download quantized GGUF file
`GET`	`/api/quantization/backends`	List available quantization backends

Job Request Fields

Field	Type	Description
`model`	string	HuggingFace model ID or local path (required)
`backend`	string	Backend name (default: `llama-cpp-quantization`)
`quantization_type`	string	Quantization format (default: `q4_k_m`)
`extra_options`	map	Backend-specific options (see below)

Extra Options

Key	Description
`hf_token`	HuggingFace token for gated models

Import Request Fields

Field	Type	Description
`name`	string	Model name for LocalAI (auto-generated if empty)

Job Status Values

Status	Description
`queued`	Job created, waiting to start
`downloading`	Downloading model from HuggingFace
`converting`	Converting model to f16 GGUF
`quantizing`	Running quantization
`completed`	Quantization finished successfully
`failed`	Job failed (check message for details)
`stopped`	Job stopped by user

Progress Stream

The GET /api/quantization/jobs/:id/progress endpoint returns Server-Sent Events (SSE) with JSON payloads:

json

{
  "job_id": "abc-123",
  "progress_percent": 65.0,
  "status": "quantizing",
  "message": "[ 234/ 567] quantizing blk.5.attn_k.weight ...",
  "output_file": "",
  "extra_metrics": {}
}

When the job completes, output_file contains the path to the quantized GGUF file and extra_metrics includes file_size_mb.

Quantization Types

Type	Size	Quality	Description
`q2_k`	Smallest	Lowest	2-bit quantization
`q3_k_s`	Very small	Low	3-bit small
`q3_k_m`	Very small	Low	3-bit medium
`q3_k_l`	Small	Low-medium	3-bit large
`q4_0`	Small	Medium	4-bit legacy
`q4_k_s`	Small	Medium	4-bit small
`q4_k_m`	Small	Good	4-bit medium (recommended)
`q5_0`	Medium	Good	5-bit legacy
`q5_k_s`	Medium	Good	5-bit small
`q5_k_m`	Medium	Very good	5-bit medium
`q6_k`	Large	Excellent	6-bit
`q8_0`	Large	Near-lossless	8-bit
`f16`	Largest	Lossless	16-bit (no quantization, GGUF conversion only)

The UI also supports entering a custom quantization type string for any format supported by llama-quantize.

Web UI

A "Quantize" page appears in the sidebar under the Tools section. The UI provides:

Job Configuration — Select model, quantization type (dropdown with presets or custom input), backend, and HuggingFace token
Progress Monitor — Real-time progress bar and log output via SSE
Jobs List — View all quantization jobs with status, stop/delete actions
Output — Download the quantized GGUF file or import it directly into LocalAI for immediate use

Architecture

Quantization uses the same gRPC backend architecture as fine-tuning:

Proto layer: QuantizationRequest, QuantizationProgress (streaming), StopQuantization
Python backend: Downloads model, runs convert_hf_to_gguf.py and llama-quantize
Go service: Manages job lifecycle, state persistence, async import
REST API: HTTP endpoints with SSE progress streaming
React UI: Configuration form, real-time progress monitor, download/import panel