Back to Candle

SmolLM3 Unified Inference

candle-examples/examples/smollm3/README.md

0.10.14.0 KB
Original Source

SmolLM3 Unified Inference

A unified Rust implementation for running SmolLM3 models using the Candle ML framework. Supports both quantized (GGUF) and full precision (safetensors) models with a single codebase.

Features

  • Dual Model Support: Run either quantized or full precision models
  • Multiple Quantization Levels: Q4_K_M (1.9GB), Q8_0 (3.3GB), F16 (6.2GB)
  • Chat Template Support: Automatic formatting for instruction-tuned models
  • Thinking Mode: Enable reasoning traces with /think mode
  • NoPE Architecture: Supports SmolLM3's mixed RoPE/NoPE layer configuration
  • Auto-download: Automatically fetches models from HuggingFace Hub

Quick Start

bash
cargo run --release --example smollm3 -- \
  --model-type quantized \
  --quantization q8_0 \
  --prompt "Explain Rust's ownership system"

Full Precision Model

bash
cargo run --release --example smollm3 -- \
  --model-type full \
  --dtype f16 \
  --prompt "Write a sorting algorithm in Rust"

Command Line Options

Model Selection

  • --model-type <TYPE>: Choose quantized or full (default: quantized)
  • --model <VARIANT>: Choose 3b (instruct) or 3b-base (default: 3b)
  • --quantization <LEVEL>: For quantized models - q4_k_m, q8_0, or f16 (default: q8_0)
  • --dtype <TYPE>: For full models - f32, f16, bf16, or auto (default: auto)

Generation Parameters

  • --prompt <TEXT>: The prompt to generate from
  • -n, --sample-len <NUM>: Number of tokens to generate (default: 1000)
  • --temperature <FLOAT>: Sampling temperature, 0 for greedy (default: 0.8)
  • --top-p <FLOAT>: Nucleus sampling probability cutoff
  • --top-k <NUM>: Only sample among top K tokens
  • --repeat-penalty <FLOAT>: Penalty for repeating tokens (default: 1.1)
  • --repeat-last-n <NUM>: Context size for repeat penalty (default: 64)

Advanced Options

  • --no-chat-template: Disable chat template formatting (use for base models)
  • --thinking: Enable thinking/reasoning mode with /think tags
  • --split-prompt: Process prompt tokens individually (for debugging)
  • --tracing: Enable performance tracing (generates trace JSON)
  • --model-path <PATH>: Use local model file instead of auto-download
  • --tokenizer <PATH>: Use local tokenizer instead of auto-download

Quantization Comparison

LevelSizeQualityUse Case
Q4_K_M1.9GBGoodFast inference, constrained environments
Q8_03.3GBBetterBalanced quality and speed
F166.2GBBestMaximum quality in GGUF format

Examples

Creative Writing with Thinking Mode

bash
cargo run --release --example smollm3 -- \
  --thinking \
  --temperature 0.9 \
  --prompt "Write a short sci-fi story about AI"

Code Generation (Base Model)

bash
cargo run --release --example smollm3 -- \
  --model 3b-base \
  --no-chat-template \
  --temperature 0.2 \
  --prompt "def fibonacci(n):"

High Quality Output

bash
cargo run --release --example smollm3 -- \
  --model-type full \
  --dtype f16 \
  --temperature 0.7 \
  --prompt "Explain quantum entanglement"

Model Architecture

SmolLM3 uses a hybrid RoPE/NoPE architecture:

  • RoPE layers: Standard rotary position embeddings (75% of layers)
  • NoPE layers: No position embeddings (25% of layers - every 4th layer)

This configuration is automatically detected and handled by the implementation.

Hardware Requirements

  • Quantized Q4_K_M: ~2.5GB RAM
  • Quantized Q8_0: ~4GB RAM
  • Full F16: ~7GB RAM
  • Full F32: ~13GB RAM

GPU acceleration supported via CUDA (with cuda feature) or Metal (macOS).

Troubleshooting

Model download fails: Check internet connection and HuggingFace Hub access

Out of memory: Try a smaller quantization level or use --sample-len to reduce generation length

Compilation errors: Ensure you're using the latest version of the Candle crate

License

This implementation follows the Candle framework license. SmolLM3 models are available under Apache 2.0.