docs/TOPOLOGY.md
[!NOTE] Manual device mapping flags are deprecated in favor of automatic placement because it is easy to misconfigure them. Topology files remain the preferred way to express per-layer quantization, and you can still provide
deviceoverrides here when you truly need to. Those overrides win over the automatic mapper, so apply them sparingly. See the device mapping documentation for guidance.
Use a simple model topology to configure ISQ and device mapping for per-layer with a single YAML file (examples here)!
To support per-layer mix of ISQ, Mistral.rs supports loading a model topology YAML file. This YAML file is formatted as follows:
start-end) where start < end. start is inclusive and end is exclusiveisq) which maps to a single value, which can be any ISQ type. If not specified, there is no ISQ for this range of layers applied.device) which maps to a single value, which is one of the below. If not specified, the default loading deice will be used.
cpucuda[ORDINAL]metal[ORDINAL]Note that:
--isq/in_situ_quant) set.--isq/in_situ_quant).When loading a UQFF model, the quantization is already applied during UQFF creation. Therefore:
This is useful for deploying pre-quantized models across multiple devices without re-quantizing.
Example topology for UQFF device mapping:
# Only device mapping is used; isq would be ignored
0-16:
device: cuda[0]
16-32:
device: cuda[1]
See the UQFF documentation for complete examples.
Layer ranges are convenient when you know the numeric index, but you can also target weights by name. Keys wrapped in /.../ are interpreted as regular expressions that are matched against the fully qualified tensor name (for example, model.layers.3.attn.q_proj.weight). Regex selectors may override both isq and device.
'/attn\.q_proj$/':
isq: Q4K
'/ffn_.*\.weight$/':
isq: Q3K
Regex-based ISQ overrides are applied through the immediate ISQ system, so they quantize weights as they are loaded. Numeric layer ranges continue to be handled by the post-load topology pass. Regex selectors are evaluated top-to-bottom as they appear in the YAML file, so a selector that comes later in the file overrides earlier matches.
0-8:
isq: Q3K
device: cuda[0]
8-16:
isq: Q4K
device: cpu
16-24:
isq: Q6K
# Skip 24-28
28-32:
isq: Q8_0
device: cuda[0]
Model topologies may be applied to all model types.
mistralrs run -m microsoft/Phi-3-mini-128k-instruct --topology topologies/isq.yml
mistralrs serve -p 1234 -m microsoft/Phi-3-mini-128k-instruct --topology topologies/isq.yml
Example here.
Example here.