docs/src/content/docs/guides/quantization/online-calibration.mdx
import { Tabs, TabItem } from '@astrojs/starlight/components';
Online calibration observes the activations of a live model quantized with ISQ (in-situ quantization), then requantizes every layer from the original weights using an importance matrix derived from that traffic. The layers are hot-swapped in place with no restart: the model serves normally while collecting, and requests received during the apply step queue until it finishes.
Quantized this way, a model is measurably closer to its full-precision outputs on the distribution it actually serves, at the same bit width and speed.
Serve any model with ISQ:
mistralrs serve -m <model> --isq q4k
Then drive the lifecycle on the surface of your choice. There is no CLI command for the lifecycle itself; it is driven over HTTP or from an SDK against the running server.
<Tabs> <TabItem label="HTTP"># begin observing live traffic; collection adds some decode overhead while on, and on CUDA,
# MoE (Mixture of Experts) models additionally run their reference expert path during collection
curl -X POST localhost:1234/calibration/start
# check per-layer collection progress
curl localhost:1234/calibration/status
# requantize from the source weights with the collected statistics and hot-swap
curl -X POST localhost:1234/calibration/apply \
-H "Content-Type: application/json" \
-d '{"save_cimatrix": "traffic.cimatrix"}'
status reports how many layers are collecting and the token rows seen per layer. apply
harvests the statistics, requantizes, and returns the pre-apply status. The optional
save_cimatrix writes the collected importance matrix for reuse with --imatrix.
The same lifecycle is exposed on Model:
model.begin_calibration().await?;
// ... serve traffic ...
let status = model.calibration_status().await?;
model.apply_calibration(Some("traffic.cimatrix".into())).await?;
Each method has a _with_model variant for multi-model setups. See the
full example.
runner.begin_calibration()
# ... serve traffic ...
status = runner.calibration_status()
runner.apply_calibration(save_cimatrix="traffic.cimatrix")
calibration_status returns a CalibrationStatus with collecting, layers,
layers_tracking, total_rows, min_rows, and max_rows fields. See the
full example.
Collection costs nothing until started, and decode returns to full speed after apply.
--isq from source weights (safetensors); start errors
otherwise (including models loaded --from-uqff).Q2K-Q6K). GGUF-family and AFQ types
collect and requantize; HQQ and FP8 ISQ types do not support collection, so start errors.