docs/UQFF.md
UQFF builds on our ISQ feature by allowing serialization and deserialization for models.
While ISQ is a powerful feature enabling easy quantization of models, the key limitation has been the time required for requantization. While the process is relatively fast with parallelization and other techniques, multiple runs can make the experience slow.
Comparting UQFF to GGUF:
In contrast to GGUF, which only supports the GGUF quantizations, UQFF is designed with flexibiliuty in mind. At its code, it extends the power and flexibility of ISQ. The ability to support multiple quantization types (more to come!) in one simple, easy-to-use file is a critical feature.
Additionally, users will no longer need to wait for GGUF support to begin using post-training quantized models. As we add new models and quantization schemes to mistral.rs, the feature set of UQFF will grow.
The following quantization formats are supported in UQFF. One can, of course, be combined arbitrarily during UQFF generation or ISQ using a model topology. When loading a UQFF model, only the per-layer device mapping feature of the topology applies.
GGUF quantized:
HQQ quantized:
FP8:
AFQ quantized (🔥 AFQ is fast on Metal):
F8Q8:
To load a UQFF model, specify the filename of the first (or only) UQFF shard. This will be located based on the model ID, and can be loaded locally or from Hugging Face based on the model ID.
phi3.5-mini-instruct-q4k-0.uqff../UQFF/phi3.5-mini-instruct-q4k-0.uqffYou can find a collection of UQFF models here, which each include a simple command to get started.
Note: when loading an UQFF model, any ISQ setting will be ignored.
Large models produce multiple shard files (e.g., q4k-0.uqff, q4k-1.uqff, q4k-2.uqff). You only need to specify one shard file -- the remaining shards are auto-discovered from the same directory or Hugging Face repository.
For example, if a model has shards q4k-0.uqff, q4k-1.uqff, and q4k-2.uqff:
# Just specify the first shard -- the rest are found automatically
mistralrs run -m EricB/MyModel-UQFF --from-uqff q4k-0.uqff
This also works when multiple quantizations exist in the same repo (e.g., q4k-* and q8_0-*). Only the shards matching the specified prefix are loaded.
mistralrs run -m EricB/Phi-3.5-mini-instruct-UQFF --from-uqff phi3.5-mini-instruct-f8e4m3-0.uqff
Check out the following examples:
Modify the Which instantiation as follows:
Which.Plain(
model_id="EricB/Phi-3.5-mini-instruct-UQFF",
+ from_uqff="phi3.5-mini-instruct-q4k-0.uqff"
),
When loading a UQFF model, the quantization is already baked in, so ISQ settings in the topology are ignored. However, device mapping from a topology file still applies. This is useful for splitting a pre-quantized model across multiple GPUs or offloading layers to CPU.
CLI example:
mistralrs run -m EricB/Phi-3.5-mini-instruct-UQFF --from-uqff phi3.5-mini-instruct-q4k.uqff --topology device_map.yml
Topology file for device mapping only (device_map.yml):
0-16:
device: cuda[0]
16-32:
device: cuda[1]
Rust SDK example:
use mistralrs::{UqffTextModelBuilder, Topology, LayerTopology, Device};
let model = UqffTextModelBuilder::new(
"EricB/Phi-3.5-mini-instruct-UQFF",
vec!["phi3.5-mini-instruct-q4k.uqff".into()],
)
.into_inner()
.with_topology(
Topology::empty()
.with_range(0..16, LayerTopology { isq: None, device: Some(Device::Cuda(0)) })
.with_range(16..32, LayerTopology { isq: None, device: Some(Device::Cuda(1)) })
)
.build()
.await?;
Python SDK example:
runner = Runner(
which=Which.Plain(
model_id="EricB/Phi-3.5-mini-instruct-UQFF",
from_uqff="phi3.5-mini-instruct-q4k.uqff",
topology="device_map.yml",
),
)
Note: The
isqfield in topology entries is ignored when loading UQFF models since quantization is pre-applied.
Creating a UQFF model requires you to generate the UQFF file.
.uqff file path or a directory where files will be auto-named.Along with the UQFF file, the generation process will also output several .json configuration files and residual.safetensors. All of these files are considered the
UQFF model, and should be kept together or uploaded.
Note: Only the
.uqfffiles are unique to the quantization level(s). If you are generating multiple UQFF files, it is OK for the others to be overwritten.
Single quantization (file output):
mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k -o phi3.5-uqff/phi3.5-mini-instruct-q4k.uqff
Single quantization (directory output):
mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k -o phi3.5-uqff/
Multiple quantizations at once (directory output):
Generate multiple UQFF files by specifying multiple --isq types. All quantizations go to the same output directory.
# Comma-separated ISQ types
mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k,q8_0 -o phi3.5-uqff/
# Equivalent: repeated --isq flags
mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k --isq q8_0 -o phi3.5-uqff/
This produces the following in phi3.5-uqff/:
q4k-0.uqff (and additional shards q4k-1.uqff, ... if the model is large)q8_0-0.uqff (and additional shards if needed)README.md (auto-generated model card for Hugging Face)config.json, tokenizer.json, residual.safetensors, etc.Note: Multiple
--isqvalues require a directory output path (not a.uqfffile path).
When using directory output mode, the quantize command automatically generates a README.md model card in the output directory. This model card includes Hugging Face YAML frontmatter, a description, and an examples table with the appropriate --from-uqff commands for each quantization.
By default, the command prompts interactively for the base model and HF repo ID. To bypass the interactive prompts (e.g. in CI or scripts), use --uqff-base-model and/or --uqff-repo-id:
mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k -o phi3.5-uqff/ \
--uqff-base-model microsoft/Phi-3.5-mini-instruct \
--uqff-repo-id EricB/Phi-3.5-mini-instruct-UQFF
To skip model card generation entirely, use --no-readme:
mistralrs quantize -m microsoft/Phi-3.5-mini-instruct --isq q4k -o phi3.5-uqff/ --no-readme
After quantization completes in directory mode, the quantize command prints the hf CLI upload command you can use. The general form is:
hf upload <YOUR_USERNAME>/<MODEL_NAME>-UQFF <output_dir> --repo-type model --private
Alternatively, you can upload with Git LFS:
git lfs installhf lfs-enable-largefiles . (you will need to pip install huggingface_hub)After this, you can use Git to track, commit, and push files.
You can find a list of models in the Hugging Face model collection.
Have you created a UQFF model on Hugging Face? If so, please create an issue.