skills/mlops/inference/llama-cpp/references/quantization.md
Complete guide to GGUF quantization formats and model conversion.
Before using generic tables, open the model repo with:
https://huggingface.co/<repo>?local-app=llama.cpp
Prefer the exact quant labels and sizes shown in the Hardware compatibility section of the fetched ?local-app=llama.cpp page text or HTML. Then confirm the matching filenames in:
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
Use the Hub page first, and only fall back to the generic heuristics below when the repo page does not expose a clear recommendation.
GGUF (GPT-Generated Unified Format) - Standard format for llama.cpp models.
| Format | Perplexity | Size (7B) | Tokens/sec | Notes |
|---|---|---|---|---|
| FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality |
| Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless |
| Q6_K | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size |
| Q5_K_M | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced |
| Q4_K_M | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | Recommended |
| Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality |
| Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only |
| Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended |
Recommendation: Use Q4_K_M for best balance of quality and speed.
# 1. Download Hugging Face model
hf download meta-llama/Llama-2-7b-chat-hf \
--local-dir models/llama-2-7b-chat/
# 2. Convert to FP16 GGUF
python convert_hf_to_gguf.py \
models/llama-2-7b-chat/ \
--outtype f16 \
--outfile models/llama-2-7b-chat-f16.gguf
# 3. Quantize to Q4_K_M
./llama-quantize \
models/llama-2-7b-chat-f16.gguf \
models/llama-2-7b-chat-Q4_K_M.gguf \
Q4_K_M
# Quantize to multiple formats
for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
./llama-quantize \
model-f16.gguf \
model-${quant}.gguf \
$quant
done
K-quants use mixed precision for better quality:
Variants:
_S (Small): Faster, lower quality_M (Medium): Balanced (recommended)_L (Large): Better quality, larger sizeExample: Q4_K_M
Q4: 4-bit quantizationK: Mixed precision methodM: Medium quality# Calculate perplexity (quality metric)
./llama-perplexity \
-m model.gguf \
-f wikitext-2-raw/wiki.test.raw \
-c 512
# Lower perplexity = better quality
# Baseline (FP16): ~5.96
# Q4_K_M: ~6.06 (+1.7%)
# Q2_K: ~6.87 (+15.3% - too much degradation)
Q4_K_M - Best balance
Q5_K_M - If you have extra RAM
Q5_K_M or Q6_K - Higher precision helps with code
Q4_K_M - Sufficient quality
Q3_K_M - Acceptable for draft generation
Q6_K or Q8_0 - Maximum accuracy
Q2_K or Q3_K_S - Fit in limited RAM
| Format | Size | RAM needed |
|---|---|---|
| Q2_K | 2.7 GB | 5 GB |
| Q3_K_M | 3.3 GB | 6 GB |
| Q4_K_M | 4.1 GB | 7 GB |
| Q5_K_M | 4.8 GB | 8 GB |
| Q6_K | 5.5 GB | 9 GB |
| Q8_0 | 7.0 GB | 11 GB |
| Format | Size | RAM needed |
|---|---|---|
| Q2_K | 5.1 GB | 8 GB |
| Q3_K_M | 6.2 GB | 10 GB |
| Q4_K_M | 7.9 GB | 12 GB |
| Q5_K_M | 9.2 GB | 14 GB |
| Q6_K | 10.7 GB | 16 GB |
| Format | Size | RAM needed |
|---|---|---|
| Q2_K | 26 GB | 32 GB |
| Q3_K_M | 32 GB | 40 GB |
| Q4_K_M | 41 GB | 48 GB |
| Q4_K_S | 39 GB | 46 GB |
| Q5_K_M | 48 GB | 56 GB |
Recommendation for 70B: Use Q3_K_M or Q4_K_S to fit in consumer hardware.
Use the Hub search with the llama.cpp app filter:
https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
For a specific repo, open:
https://huggingface.co/<repo>?local-app=llama.cpp
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
Then launch directly from the Hub without extra Hub tooling:
llama-cli -hf <repo>:Q4_K_M
llama-server -hf <repo>:Q4_K_M
If you need the exact file name from the tree API:
llama-server --hf-repo <repo> --hf-file <filename.gguf>
What: Calibration data to improve quantization quality.
Benefits:
Usage:
# 1. Generate importance matrix
./llama-imatrix \
-m model-f16.gguf \
-f calibration-data.txt \
-o model.imatrix
# 2. Quantize with imatrix
./llama-quantize \
--imatrix model.imatrix \
model-f16.gguf \
model-Q4_K_M.gguf \
Q4_K_M
Calibration data:
Model outputs gibberish:
Out of memory:
-ngl)-c 2048)Slow inference: