candle-examples/examples/smollm3/README.md
A unified Rust implementation for running SmolLM3 models using the Candle ML framework. Supports both quantized (GGUF) and full precision (safetensors) models with a single codebase.
/think modecargo run --release --example smollm3 -- \
--model-type quantized \
--quantization q8_0 \
--prompt "Explain Rust's ownership system"
cargo run --release --example smollm3 -- \
--model-type full \
--dtype f16 \
--prompt "Write a sorting algorithm in Rust"
--model-type <TYPE>: Choose quantized or full (default: quantized)--model <VARIANT>: Choose 3b (instruct) or 3b-base (default: 3b)--quantization <LEVEL>: For quantized models - q4_k_m, q8_0, or f16 (default: q8_0)--dtype <TYPE>: For full models - f32, f16, bf16, or auto (default: auto)--prompt <TEXT>: The prompt to generate from-n, --sample-len <NUM>: Number of tokens to generate (default: 1000)--temperature <FLOAT>: Sampling temperature, 0 for greedy (default: 0.8)--top-p <FLOAT>: Nucleus sampling probability cutoff--top-k <NUM>: Only sample among top K tokens--repeat-penalty <FLOAT>: Penalty for repeating tokens (default: 1.1)--repeat-last-n <NUM>: Context size for repeat penalty (default: 64)--no-chat-template: Disable chat template formatting (use for base models)--thinking: Enable thinking/reasoning mode with /think tags--split-prompt: Process prompt tokens individually (for debugging)--tracing: Enable performance tracing (generates trace JSON)--model-path <PATH>: Use local model file instead of auto-download--tokenizer <PATH>: Use local tokenizer instead of auto-download| Level | Size | Quality | Use Case |
|---|---|---|---|
| Q4_K_M | 1.9GB | Good | Fast inference, constrained environments |
| Q8_0 | 3.3GB | Better | Balanced quality and speed |
| F16 | 6.2GB | Best | Maximum quality in GGUF format |
cargo run --release --example smollm3 -- \
--thinking \
--temperature 0.9 \
--prompt "Write a short sci-fi story about AI"
cargo run --release --example smollm3 -- \
--model 3b-base \
--no-chat-template \
--temperature 0.2 \
--prompt "def fibonacci(n):"
cargo run --release --example smollm3 -- \
--model-type full \
--dtype f16 \
--temperature 0.7 \
--prompt "Explain quantum entanglement"
SmolLM3 uses a hybrid RoPE/NoPE architecture:
This configuration is automatically detected and handled by the implementation.
GPU acceleration supported via CUDA (with cuda feature) or Metal (macOS).
Model download fails: Check internet connection and HuggingFace Hub access
Out of memory: Try a smaller quantization level or use --sample-len to reduce generation length
Compilation errors: Ensure you're using the latest version of the Candle crate
This implementation follows the Candle framework license. SmolLM3 models are available under Apache 2.0.