Back to Hermes Agent

GGUF Quantization Guide

skills/mlops/inference/llama-cpp/references/quantization.md

2026.6.55.7 KB
Original Source

GGUF Quantization Guide

Complete guide to GGUF quantization formats and model conversion.

Hub-first quant selection

Before using generic tables, open the model repo with:

text
https://huggingface.co/<repo>?local-app=llama.cpp

Prefer the exact quant labels and sizes shown in the Hardware compatibility section of the fetched ?local-app=llama.cpp page text or HTML. Then confirm the matching filenames in:

text
https://huggingface.co/api/models/<repo>/tree/main?recursive=true

Use the Hub page first, and only fall back to the generic heuristics below when the repo page does not expose a clear recommendation.

Quantization Overview

GGUF (GPT-Generated Unified Format) - Standard format for llama.cpp models.

Format Comparison

FormatPerplexitySize (7B)Tokens/secNotes
FP165.9565 (baseline)13.0 GB15 tok/sOriginal quality
Q8_05.9584 (+0.03%)7.0 GB25 tok/sNearly lossless
Q6_K5.9642 (+0.13%)5.5 GB30 tok/sBest quality/size
Q5_K_M5.9796 (+0.39%)4.8 GB35 tok/sBalanced
Q4_K_M6.0565 (+1.68%)4.1 GB40 tok/sRecommended
Q4_K_S6.1125 (+2.62%)3.9 GB42 tok/sFaster, lower quality
Q3_K_M6.3184 (+6.07%)3.3 GB45 tok/sSmall models only
Q2_K6.8673 (+15.3%)2.7 GB50 tok/sNot recommended

Recommendation: Use Q4_K_M for best balance of quality and speed.

Converting Models

Hugging Face to GGUF

bash
# 1. Download Hugging Face model
hf download meta-llama/Llama-2-7b-chat-hf \
    --local-dir models/llama-2-7b-chat/

# 2. Convert to FP16 GGUF
python convert_hf_to_gguf.py \
    models/llama-2-7b-chat/ \
    --outtype f16 \
    --outfile models/llama-2-7b-chat-f16.gguf

# 3. Quantize to Q4_K_M
./llama-quantize \
    models/llama-2-7b-chat-f16.gguf \
    models/llama-2-7b-chat-Q4_K_M.gguf \
    Q4_K_M

Batch quantization

bash
# Quantize to multiple formats
for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
    ./llama-quantize \
        model-f16.gguf \
        model-${quant}.gguf \
        $quant
done

K-Quantization Methods

K-quants use mixed precision for better quality:

  • Attention weights: Higher precision
  • Feed-forward weights: Lower precision

Variants:

  • _S (Small): Faster, lower quality
  • _M (Medium): Balanced (recommended)
  • _L (Large): Better quality, larger size

Example: Q4_K_M

  • Q4: 4-bit quantization
  • K: Mixed precision method
  • M: Medium quality

Quality Testing

bash
# Calculate perplexity (quality metric)
./llama-perplexity \
    -m model.gguf \
    -f wikitext-2-raw/wiki.test.raw \
    -c 512

# Lower perplexity = better quality
# Baseline (FP16): ~5.96
# Q4_K_M: ~6.06 (+1.7%)
# Q2_K: ~6.87 (+15.3% - too much degradation)

Use Case Guide

General purpose (chatbots, assistants)

Q4_K_M - Best balance
Q5_K_M - If you have extra RAM

Code generation

Q5_K_M or Q6_K - Higher precision helps with code

Creative writing

Q4_K_M - Sufficient quality
Q3_K_M - Acceptable for draft generation

Technical/medical

Q6_K or Q8_0 - Maximum accuracy

Edge devices (Raspberry Pi)

Q2_K or Q3_K_S - Fit in limited RAM

Model Size Scaling

7B parameter models

FormatSizeRAM needed
Q2_K2.7 GB5 GB
Q3_K_M3.3 GB6 GB
Q4_K_M4.1 GB7 GB
Q5_K_M4.8 GB8 GB
Q6_K5.5 GB9 GB
Q8_07.0 GB11 GB

13B parameter models

FormatSizeRAM needed
Q2_K5.1 GB8 GB
Q3_K_M6.2 GB10 GB
Q4_K_M7.9 GB12 GB
Q5_K_M9.2 GB14 GB
Q6_K10.7 GB16 GB

70B parameter models

FormatSizeRAM needed
Q2_K26 GB32 GB
Q3_K_M32 GB40 GB
Q4_K_M41 GB48 GB
Q4_K_S39 GB46 GB
Q5_K_M48 GB56 GB

Recommendation for 70B: Use Q3_K_M or Q4_K_S to fit in consumer hardware.

Finding Pre-Quantized Models

Use the Hub search with the llama.cpp app filter:

text
https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending

For a specific repo, open:

text
https://huggingface.co/<repo>?local-app=llama.cpp
https://huggingface.co/api/models/<repo>/tree/main?recursive=true

Then launch directly from the Hub without extra Hub tooling:

bash
llama-cli -hf <repo>:Q4_K_M
llama-server -hf <repo>:Q4_K_M

If you need the exact file name from the tree API:

bash
llama-server --hf-repo <repo> --hf-file <filename.gguf>

Importance Matrices (imatrix)

What: Calibration data to improve quantization quality.

Benefits:

  • 10-20% perplexity improvement with Q4
  • Essential for Q3 and below

Usage:

bash
# 1. Generate importance matrix
./llama-imatrix \
    -m model-f16.gguf \
    -f calibration-data.txt \
    -o model.imatrix

# 2. Quantize with imatrix
./llama-quantize \
    --imatrix model.imatrix \
    model-f16.gguf \
    model-Q4_K_M.gguf \
    Q4_K_M

Calibration data:

  • Use domain-specific text (e.g., code for code models)
  • ~100MB of representative text
  • Higher quality data = better quantization

Troubleshooting

Model outputs gibberish:

  • Quantization too aggressive (Q2_K)
  • Try Q4_K_M or Q5_K_M
  • Verify model converted correctly

Out of memory:

  • Use lower quantization (Q4_K_S instead of Q5_K_M)
  • Offload fewer layers to GPU (-ngl)
  • Use smaller context (-c 2048)

Slow inference:

  • Higher quantization uses more compute
  • Q8_0 much slower than Q4_K_M
  • Consider speed vs quality trade-off