Back to Mistral Rs

mistralrs quantize

docs/src/content/docs/reference/cli/quantize.md

0.8.2210.1 KB
Original Source
<!-- Generated from clap definitions by mistralrs-cli docgen. Do not edit. -->

Generate UQFF quantized model file

mistralrs quantize [OPTIONS] [COMMAND]
OptionDefaultDescription
-m, --model-id <MODEL_ID>HuggingFace model ID or local path to model directory
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
--dtype <DTYPE>autoModel data type
--isq <IN_SITU_QUANT>In-situ quantization level(s). Multiple values can be comma-separated or specified via repeated --isq flags (e.g., "--isq q4k,q8_0" or "--isq q4k --isq q8_0")
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20")
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
-o, --output <OUTPUT_PATH>Output path: a .uqff file path (single ISQ) or a directory (auto-names files per ISQ type)
--no-readmefalseSkip README.md model card generation (generated by default in directory mode)
--uqff-base-model <UQFF_BASE_MODEL>Base model ID for the generated README (skips interactive prompt)
--uqff-repo-id <UQFF_REPO_ID>HF repo ID for the generated README and upload hint (skips interactive prompt)
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping

mistralrs quantize auto

Auto-detect model type (recommended)

mistralrs quantize auto [OPTIONS] --model-id <MODEL_ID> --isq <IN_SITU_QUANT> --output <OUTPUT_PATH>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredModel ID to load (HuggingFace repo or local path)
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
--dtype <DTYPE>autoModel data type
--isq <IN_SITU_QUANT>requiredIn-situ quantization level(s). Multiple values can be comma-separated or specified via repeated --isq flags (e.g., "--isq q4k,q8_0" or "--isq q4k --isq q8_0")
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20")
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
-o, --output <OUTPUT_PATH>requiredOutput path: a .uqff file path (single ISQ) or a directory (auto-names files per ISQ type). Examples: -o model/model-q4k.uqff or -o output/
--no-readmefalseSkip README.md model card generation (generated by default in directory mode)
--uqff-base-model <UQFF_BASE_MODEL>Base model ID for the generated README (skips interactive prompt)
--uqff-repo-id <UQFF_REPO_ID>HF repo ID for the generated README and upload hint (skips interactive prompt)
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping

mistralrs quantize text

Text generation model with explicit architecture

mistralrs quantize text [OPTIONS] --model-id <MODEL_ID> --isq <IN_SITU_QUANT> --output <OUTPUT_PATH>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredModel ID to load (HuggingFace repo or local path)
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
--dtype <DTYPE>autoModel data type
-a, --arch <ARCH>Model architecture (required for text models)
--isq <IN_SITU_QUANT>requiredIn-situ quantization level(s). Multiple values can be comma-separated or specified via repeated --isq flags (e.g., "--isq q4k,q8_0" or "--isq q4k --isq q8_0")
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20")
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
-o, --output <OUTPUT_PATH>requiredOutput path: a .uqff file path (single ISQ) or a directory (auto-names files per ISQ type). Examples: -o model/model-q4k.uqff or -o output/
--no-readmefalseSkip README.md model card generation (generated by default in directory mode)
--uqff-base-model <UQFF_BASE_MODEL>Base model ID for the generated README (skips interactive prompt)
--uqff-repo-id <UQFF_REPO_ID>HF repo ID for the generated README and upload hint (skips interactive prompt)

mistralrs quantize multimodal

Multimodal model

mistralrs quantize multimodal [OPTIONS] --model-id <MODEL_ID> --isq <IN_SITU_QUANT> --output <OUTPUT_PATH>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredModel ID to load (HuggingFace repo or local path)
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
--dtype <DTYPE>autoModel data type
--isq <IN_SITU_QUANT>requiredIn-situ quantization level(s). Multiple values can be comma-separated or specified via repeated --isq flags (e.g., "--isq q4k,q8_0" or "--isq q4k --isq q8_0")
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20")
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
-o, --output <OUTPUT_PATH>requiredOutput path: a .uqff file path (single ISQ) or a directory (auto-names files per ISQ type). Examples: -o model/model-q4k.uqff or -o output/
--no-readmefalseSkip README.md model card generation (generated by default in directory mode)
--uqff-base-model <UQFF_BASE_MODEL>Base model ID for the generated README (skips interactive prompt)
--uqff-repo-id <UQFF_REPO_ID>HF repo ID for the generated README and upload hint (skips interactive prompt)
--max-edge <MAX_EDGE>Maximum edge length for image resizing (aspect ratio preserved)
--max-num-images <MAX_NUM_IMAGES>Maximum number of images per request
--max-image-length <MAX_IMAGE_LENGTH>Maximum image dimension for device mapping

mistralrs quantize embedding

Embedding model

mistralrs quantize embedding [OPTIONS] --model-id <MODEL_ID> --isq <IN_SITU_QUANT> --output <OUTPUT_PATH>
OptionDefaultDescription
-m, --model-id <MODEL_ID>requiredModel ID to load (HuggingFace repo or local path)
-t, --tokenizer <TOKENIZER>Path to local tokenizer.json file
--dtype <DTYPE>autoModel data type
--isq <IN_SITU_QUANT>requiredIn-situ quantization level(s). Multiple values can be comma-separated or specified via repeated --isq flags (e.g., "--isq q4k,q8_0" or "--isq q4k --isq q8_0")
--isq-organization <ISQ_ORGANIZATION>ISQ organization strategy: default or moqe
--imatrix <IMATRIX>imatrix file for enhanced quantization
--calibration-file <CALIBRATION_FILE>Calibration file for imatrix generation
--cpufalseForce CPU-only execution
-n, --device-layers <DEVICE_LAYERS>Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20")
--topology <TOPOLOGY>Topology YAML file for device mapping
--hf-cache <HF_CACHE>Custom HuggingFace cache directory
--max-seq-len <MAX_SEQ_LEN>4096Max sequence length for automatic device mapping
--max-batch-size <MAX_BATCH_SIZE>1Max batch size for automatic device mapping
-o, --output <OUTPUT_PATH>requiredOutput path: a .uqff file path (single ISQ) or a directory (auto-names files per ISQ type). Examples: -o model/model-q4k.uqff or -o output/
--no-readmefalseSkip README.md model card generation (generated by default in directory mode)
--uqff-base-model <UQFF_BASE_MODEL>Base model ID for the generated README (skips interactive prompt)
--uqff-repo-id <UQFF_REPO_ID>HF repo ID for the generated README and upload hint (skips interactive prompt)