docs/src/content/docs/reference/cli/run.md
Run model in interactive mode, or one-shot mode with -i
mistralrs run [OPTIONS] [COMMAND]
| Option | Default | Description |
|---|---|---|
-m, --model-id <MODEL_ID> | HuggingFace model ID or local path to model directory | |
-t, --tokenizer <TOKENIZER> | Path to local tokenizer.json file | |
-a, --arch <ARCH> | Model architecture (auto-detected if not specified) | |
--dtype <DTYPE> | auto | Model data type |
--format <FORMAT> | Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml. | |
-f, --quantized-file <QUANTIZED_FILE> | Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple) | |
--tok-model-id <TOK_MODEL_ID> | Model ID for tokenizer when using quantized format | |
--gqa <GQA> | 1 | GQA value for GGML models |
--lora <LORA> | LoRA adapter model ID(s), semicolon-separated for multiple | |
--xlora <XLORA> | X-LoRA adapter model ID | |
--xlora-order <XLORA_ORDER> | X-LoRA ordering JSON file | |
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX> | Target non-granular index for X-LoRA | |
--quant <QUANT> | Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob | |
--isq <IN_SITU_QUANT> | In-situ quantization level (e.g., "4", "8", "q4_0", "q4_1", etc.) | |
--from-uqff <FROM_UQFF> | UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations | |
--isq-organization <ISQ_ORGANIZATION> | ISQ organization strategy: default or moqe | |
--imatrix <IMATRIX> | imatrix file for enhanced quantization | |
--calibration-file <CALIBRATION_FILE> | Calibration file for imatrix generation | |
--cpu | false | Force CPU-only execution |
-n, --device-layers <DEVICE_LAYERS> | Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20") Omit for automatic device mapping | |
--topology <TOPOLOGY> | Topology YAML file for device mapping | |
--hf-cache <HF_CACHE> | Custom HuggingFace cache directory | |
--max-seq-len <MAX_SEQ_LEN> | 4096 | Max sequence length for automatic device mapping |
--max-batch-size <MAX_BATCH_SIZE> | 1 | Max batch size for automatic device mapping |
--paged-attn <MODE> | auto | PagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off. |
--pa-context-len <CONTEXT_LEN> | Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM | |
--pa-memory-mb <MEMORY_MB> | GPU memory to allocate in MBs (alternative to context-len) | |
--pa-memory-fraction <MEMORY_FRACTION> | GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb) | |
--pa-block-size <BLOCK_SIZE> | Tokens per block (default: 32 on CUDA) | |
--pa-cache-type <CACHE_TYPE> | auto | KV cache quantization type |
--max-edge <MAX_EDGE> | Maximum edge length for image resizing (aspect ratio preserved) | |
--max-num-images <MAX_NUM_IMAGES> | Maximum number of images per request | |
--max-image-length <MAX_IMAGE_LENGTH> | Maximum image dimension for device mapping | |
--max-seqs <MAX_SEQS> | 32 | Maximum concurrent sequences |
--no-kv-cache | false | Disable KV cache entirely |
--prefix-cache-n <PREFIX_CACHE_N> | 16 | Number of prefix caches to hold (0 to disable) |
-c, --chat-template <CHAT_TEMPLATE> | Custom chat template file (.json or .jinja) | |
-j, --jinja-explicit <JINJA_EXPLICIT> | Explicit JINJA template override | |
--matformer-config-path <MATFORMER_CONFIG_PATH> | Path to a MatFormer config (CSV/JSON describing available slices). See model card | |
--matformer-slice-name <MATFORMER_SLICE_NAME> | MatFormer slice to load (must match a slice name in the config file) | |
--mtp-model <MTP_MODEL> | MTP assistant model id or path | |
--mtp-n-predict <MTP_N_PREDICT> | Number of MTP draft tokens to propose per target step | |
--mcp-config <MCP_CONFIG> | Path to an MCP client configuration JSON. Also reads MCP_CONFIG_PATH if unset | |
--agent | false | Build a local agent: enables web search, Python code execution, and shell execution, runs the agentic tool loop with a per-session temp workdir. Equivalent to passing --enable-search --enable-code-execution --enable-shell together |
--enable-search | false | Enable web search (requires embedding model) |
--search-embedding-model <SEARCH_EMBEDDING_MODEL> | Search embedding model to use. Requires --enable-search or --agent Possible values: embedding-gemma. | |
--enable-code-execution | false | Enable Python code execution tool (WARNING: allows arbitrary code execution) |
--enable-shell | false | Enable shell execution tool (WARNING: allows arbitrary command execution) |
--code-exec-python <CODE_EXEC_PYTHON> | Python interpreter path for code execution. Requires code execution to be on (via --enable-code-execution or --agent). Defaults to python3 | |
--code-exec-timeout <CODE_EXEC_TIMEOUT> | Code execution timeout in seconds (default: 30). Requires code execution to be on | |
--code-exec-workdir <CODE_EXEC_WORKDIR> | Working directory for code execution. Defaults to a temp dir; use "." for cwd. Requires code execution to be on | |
--shell-path <SHELL_PATH> | Shell executable path. Requires shell execution to be on. Defaults to /bin/sh | |
--shell-timeout <SHELL_TIMEOUT> | Shell execution timeout in seconds (default: 30). Requires shell execution to be on | |
--shell-workdir <SHELL_WORKDIR> | Root directory for per-session shell working directories. Defaults to temp dirs | |
--skills-dir <SKILLS_DIR> | Directory for uploaded OpenAI-compatible Skills. Defaults to the system temp directory | |
--agent-permission <PERMISSION> | auto | Agent action permission mode Possible values: auto, ask, deny. |
--sandbox <MODE> | auto | Sandbox mode Possible values: auto, on, off. |
--sandbox-profile <PROFILE> | Sandbox policy profile Possible values: restricted, developer. | |
--sb-max-memory-mb <MEMORY_MB> | Per-session memory cap in MiB (default: 2048) | |
--sb-max-cpu-secs <CPU_SECS> | Per-session CPU time cap in seconds (default: 300) | |
--sb-max-procs <PROCS> | Per-session process/thread cap (default: 64) | |
--sandbox-network <NETWORK> | Network access permitted to the sandboxed session Possible values: none, loopback, full. | |
--thinking <THINKING> | Control thinking mode for models that support it. Use --thinking or --thinking true to force on, --thinking false to force off. Omit to defer to the chat template default Possible values: true, false. | |
-i, --input <INPUT> | One-shot text prompt. When provided, sends a single request and exits instead of entering interactive mode. Combine with --image, --video, or --audio for multimodal requests | |
--image <IMAGE> | Image URL(s) or file path(s) to include in the request (requires -i). Can be specified multiple times: --image img1.jpg --image img2.png | |
--video <VIDEO> | Video URL(s) or file path(s) to include in the request (requires -i). Can be specified multiple times: --video vid1.mp4 --video vid2.webm | |
--audio <AUDIO> | Audio URL(s) or file path(s) to include in the request (requires -i). Can be specified multiple times: --audio audio1.wav --audio audio2.mp3 |
Auto-detect model type (recommended)
mistralrs run auto [OPTIONS] --model-id <MODEL_ID>
| Option | Default | Description |
|---|---|---|
-m, --model-id <MODEL_ID> | required | HuggingFace model ID or local path to model directory |
-t, --tokenizer <TOKENIZER> | Path to local tokenizer.json file | |
-a, --arch <ARCH> | Model architecture (auto-detected if not specified) | |
--dtype <DTYPE> | auto | Model data type |
--format <FORMAT> | Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml. | |
-f, --quantized-file <QUANTIZED_FILE> | Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple) | |
--tok-model-id <TOK_MODEL_ID> | Model ID for tokenizer when using quantized format | |
--gqa <GQA> | 1 | GQA value for GGML models |
--lora <LORA> | LoRA adapter model ID(s), semicolon-separated for multiple | |
--xlora <XLORA> | X-LoRA adapter model ID | |
--xlora-order <XLORA_ORDER> | X-LoRA ordering JSON file | |
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX> | Target non-granular index for X-LoRA | |
--quant <QUANT> | Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob | |
--isq <IN_SITU_QUANT> | In-situ quantization level (e.g., "4", "8", "q4_0", "q4_1", etc.) | |
--from-uqff <FROM_UQFF> | UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations | |
--isq-organization <ISQ_ORGANIZATION> | ISQ organization strategy: default or moqe | |
--imatrix <IMATRIX> | imatrix file for enhanced quantization | |
--calibration-file <CALIBRATION_FILE> | Calibration file for imatrix generation | |
--cpu | false | Force CPU-only execution |
-n, --device-layers <DEVICE_LAYERS> | Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20") Omit for automatic device mapping | |
--topology <TOPOLOGY> | Topology YAML file for device mapping | |
--hf-cache <HF_CACHE> | Custom HuggingFace cache directory | |
--max-seq-len <MAX_SEQ_LEN> | 4096 | Max sequence length for automatic device mapping |
--max-batch-size <MAX_BATCH_SIZE> | 1 | Max batch size for automatic device mapping |
--paged-attn <MODE> | auto | PagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off. |
--pa-context-len <CONTEXT_LEN> | Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM | |
--pa-memory-mb <MEMORY_MB> | GPU memory to allocate in MBs (alternative to context-len) | |
--pa-memory-fraction <MEMORY_FRACTION> | GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb) | |
--pa-block-size <BLOCK_SIZE> | Tokens per block (default: 32 on CUDA) | |
--pa-cache-type <CACHE_TYPE> | auto | KV cache quantization type |
--max-edge <MAX_EDGE> | Maximum edge length for image resizing (aspect ratio preserved) | |
--max-num-images <MAX_NUM_IMAGES> | Maximum number of images per request | |
--max-image-length <MAX_IMAGE_LENGTH> | Maximum image dimension for device mapping |
Text generation model with explicit configuration
mistralrs run text [OPTIONS] --model-id <MODEL_ID>
| Option | Default | Description |
|---|---|---|
-m, --model-id <MODEL_ID> | required | HuggingFace model ID or local path to model directory |
-t, --tokenizer <TOKENIZER> | Path to local tokenizer.json file | |
-a, --arch <ARCH> | Model architecture (auto-detected if not specified) | |
--dtype <DTYPE> | auto | Model data type |
--format <FORMAT> | Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml. | |
-f, --quantized-file <QUANTIZED_FILE> | Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple) | |
--tok-model-id <TOK_MODEL_ID> | Model ID for tokenizer when using quantized format | |
--gqa <GQA> | 1 | GQA value for GGML models |
--lora <LORA> | LoRA adapter model ID(s), semicolon-separated for multiple | |
--xlora <XLORA> | X-LoRA adapter model ID | |
--xlora-order <XLORA_ORDER> | X-LoRA ordering JSON file | |
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX> | Target non-granular index for X-LoRA | |
--quant <QUANT> | Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob | |
--isq <IN_SITU_QUANT> | In-situ quantization level (e.g., "4", "8", "q4_0", "q4_1", etc.) | |
--from-uqff <FROM_UQFF> | UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations | |
--isq-organization <ISQ_ORGANIZATION> | ISQ organization strategy: default or moqe | |
--imatrix <IMATRIX> | imatrix file for enhanced quantization | |
--calibration-file <CALIBRATION_FILE> | Calibration file for imatrix generation | |
--cpu | false | Force CPU-only execution |
-n, --device-layers <DEVICE_LAYERS> | Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20") Omit for automatic device mapping | |
--topology <TOPOLOGY> | Topology YAML file for device mapping | |
--hf-cache <HF_CACHE> | Custom HuggingFace cache directory | |
--max-seq-len <MAX_SEQ_LEN> | 4096 | Max sequence length for automatic device mapping |
--max-batch-size <MAX_BATCH_SIZE> | 1 | Max batch size for automatic device mapping |
--paged-attn <MODE> | auto | PagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off. |
--pa-context-len <CONTEXT_LEN> | Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM | |
--pa-memory-mb <MEMORY_MB> | GPU memory to allocate in MBs (alternative to context-len) | |
--pa-memory-fraction <MEMORY_FRACTION> | GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb) | |
--pa-block-size <BLOCK_SIZE> | Tokens per block (default: 32 on CUDA) | |
--pa-cache-type <CACHE_TYPE> | auto | KV cache quantization type |
Multimodal model
mistralrs run multimodal [OPTIONS] --model-id <MODEL_ID>
| Option | Default | Description |
|---|---|---|
-m, --model-id <MODEL_ID> | required | HuggingFace model ID or local path to model directory |
-t, --tokenizer <TOKENIZER> | Path to local tokenizer.json file | |
-a, --arch <ARCH> | Model architecture (auto-detected if not specified) | |
--dtype <DTYPE> | auto | Model data type |
--format <FORMAT> | Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml. | |
-f, --quantized-file <QUANTIZED_FILE> | Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple) | |
--tok-model-id <TOK_MODEL_ID> | Model ID for tokenizer when using quantized format | |
--gqa <GQA> | 1 | GQA value for GGML models |
--lora <LORA> | LoRA adapter model ID(s), semicolon-separated for multiple | |
--xlora <XLORA> | X-LoRA adapter model ID | |
--xlora-order <XLORA_ORDER> | X-LoRA ordering JSON file | |
--tgt-non-granular-index <TGT_NON_GRANULAR_INDEX> | Target non-granular index for X-LoRA | |
--quant <QUANT> | Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob | |
--isq <IN_SITU_QUANT> | In-situ quantization level (e.g., "4", "8", "q4_0", "q4_1", etc.) | |
--from-uqff <FROM_UQFF> | UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations | |
--isq-organization <ISQ_ORGANIZATION> | ISQ organization strategy: default or moqe | |
--imatrix <IMATRIX> | imatrix file for enhanced quantization | |
--calibration-file <CALIBRATION_FILE> | Calibration file for imatrix generation | |
--cpu | false | Force CPU-only execution |
-n, --device-layers <DEVICE_LAYERS> | Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20") Omit for automatic device mapping | |
--topology <TOPOLOGY> | Topology YAML file for device mapping | |
--hf-cache <HF_CACHE> | Custom HuggingFace cache directory | |
--max-seq-len <MAX_SEQ_LEN> | 4096 | Max sequence length for automatic device mapping |
--max-batch-size <MAX_BATCH_SIZE> | 1 | Max batch size for automatic device mapping |
--paged-attn <MODE> | auto | PagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off. |
--pa-context-len <CONTEXT_LEN> | Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM | |
--pa-memory-mb <MEMORY_MB> | GPU memory to allocate in MBs (alternative to context-len) | |
--pa-memory-fraction <MEMORY_FRACTION> | GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb) | |
--pa-block-size <BLOCK_SIZE> | Tokens per block (default: 32 on CUDA) | |
--pa-cache-type <CACHE_TYPE> | auto | KV cache quantization type |
--max-edge <MAX_EDGE> | Maximum edge length for image resizing (aspect ratio preserved) | |
--max-num-images <MAX_NUM_IMAGES> | Maximum number of images per request | |
--max-image-length <MAX_IMAGE_LENGTH> | Maximum image dimension for device mapping |
Image generation model (diffusion)
mistralrs run diffusion [OPTIONS] --model-id <MODEL_ID>
| Option | Default | Description |
|---|---|---|
-m, --model-id <MODEL_ID> | required | HuggingFace model ID or local path to model directory |
-t, --tokenizer <TOKENIZER> | Path to local tokenizer.json file | |
-a, --arch <ARCH> | Model architecture (auto-detected if not specified) | |
--dtype <DTYPE> | auto | Model data type |
--cpu | false | Force CPU-only execution |
-n, --device-layers <DEVICE_LAYERS> | Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20") Omit for automatic device mapping | |
--topology <TOPOLOGY> | Topology YAML file for device mapping | |
--hf-cache <HF_CACHE> | Custom HuggingFace cache directory | |
--max-seq-len <MAX_SEQ_LEN> | 4096 | Max sequence length for automatic device mapping |
--max-batch-size <MAX_BATCH_SIZE> | 1 | Max batch size for automatic device mapping |
Speech synthesis model
mistralrs run speech [OPTIONS] --model-id <MODEL_ID>
| Option | Default | Description |
|---|---|---|
-m, --model-id <MODEL_ID> | required | HuggingFace model ID or local path to model directory |
-t, --tokenizer <TOKENIZER> | Path to local tokenizer.json file | |
-a, --arch <ARCH> | Model architecture (auto-detected if not specified) | |
--dtype <DTYPE> | auto | Model data type |
--cpu | false | Force CPU-only execution |
-n, --device-layers <DEVICE_LAYERS> | Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20") Omit for automatic device mapping | |
--topology <TOPOLOGY> | Topology YAML file for device mapping | |
--hf-cache <HF_CACHE> | Custom HuggingFace cache directory | |
--max-seq-len <MAX_SEQ_LEN> | 4096 | Max sequence length for automatic device mapping |
--max-batch-size <MAX_BATCH_SIZE> | 1 | Max batch size for automatic device mapping |
Embedding model
mistralrs run embedding [OPTIONS] --model-id <MODEL_ID>
| Option | Default | Description |
|---|---|---|
-m, --model-id <MODEL_ID> | required | HuggingFace model ID or local path to model directory |
-t, --tokenizer <TOKENIZER> | Path to local tokenizer.json file | |
-a, --arch <ARCH> | Model architecture (auto-detected if not specified) | |
--dtype <DTYPE> | auto | Model data type |
--format <FORMAT> | Model format: plain (safetensors), gguf, or ggml Auto-detected if not specified Possible values: plain, gguf, ggml. | |
-f, --quantized-file <QUANTIZED_FILE> | Quantized model filename(s) for GGUF/GGML (semicolon-separated for multiple) | |
--tok-model-id <TOK_MODEL_ID> | Model ID for tokenizer when using quantized format | |
--gqa <GQA> | 1 | GQA value for GGML models |
--quant <QUANT> | Quantization front-door. Numeric levels (2, 3, 4, 5, 6, 8) and ISQ names prefer a prebuilt UQFF from mistralrs-community/<model>-UQFF, then fall back to ISQ. auto is for serve, run, and bench; tune rejects it because tune is the recommender. Use --isq for the explicit knob | |
--isq <IN_SITU_QUANT> | In-situ quantization level (e.g., "4", "8", "q4_0", "q4_1", etc.) | |
--from-uqff <FROM_UQFF> | UQFF file(s) to load from. Accepts numeric shorthands (2, 3, 4, 5, 6, 8) to auto-detect the appropriate UQFF file (e.g., --from-uqff 8 finds q8_0-0.uqff or afq8-0.uqff). Also accepts ISQ type names (e.g., q4k, afq8). Shards are auto-discovered: specifying the first shard (e.g., q4k-0.uqff) automatically finds q4k-1.uqff, etc. Use semicolons to separate different quantizations | |
--isq-organization <ISQ_ORGANIZATION> | ISQ organization strategy: default or moqe | |
--imatrix <IMATRIX> | imatrix file for enhanced quantization | |
--calibration-file <CALIBRATION_FILE> | Calibration file for imatrix generation | |
--cpu | false | Force CPU-only execution |
-n, --device-layers <DEVICE_LAYERS> | Device layer mapping (format: ORD:NUM;... e.g., "0:10;1:20") Omit for automatic device mapping | |
--topology <TOPOLOGY> | Topology YAML file for device mapping | |
--hf-cache <HF_CACHE> | Custom HuggingFace cache directory | |
--max-seq-len <MAX_SEQ_LEN> | 4096 | Max sequence length for automatic device mapping |
--max-batch-size <MAX_BATCH_SIZE> | 1 | Max batch size for automatic device mapping |
--paged-attn <MODE> | auto | PagedAttention mode - auto: enabled on CUDA, disabled on Metal/CPU (default) - on: force enable (fails if unsupported) - off: force disable Possible values: auto, on, off. |
--pa-context-len <CONTEXT_LEN> | Allocate KV cache for this context length. If not specified, defaults to using 90% of available VRAM | |
--pa-memory-mb <MEMORY_MB> | GPU memory to allocate in MBs (alternative to context-len) | |
--pa-memory-fraction <MEMORY_FRACTION> | GPU memory utilization fraction 0.0-1.0 (alternative to context-len/memory-mb) | |
--pa-block-size <BLOCK_SIZE> | Tokens per block (default: 32 on CUDA) | |
--pa-cache-type <CACHE_TYPE> | auto | KV cache quantization type |