Back to Candle

SmolLM Model Family

candle-transformers/src/models/smol/README.md

0.10.18.6 KB
Original Source

SmolLM Model Family

This directory contains implementations for the SmolLM family of models developed by HuggingFace.

Models

SmolLM2 (see models/llama)

SmolLM2 models (135M, 360M, 1.7B) use the standard Llama3 architecture and are implemented in models/llama.rs. No separate implementation is needed.

Variants:

  • HuggingFaceTB/SmolLM2-135M
  • HuggingFaceTB/SmolLM2-360M
  • HuggingFaceTB/SmolLM2-1.7B

SmolLM3

SmolLM3-3B introduces NoPE (No Positional Encoding) which requires a custom implementation in smollm3.rs.

Key innovations:

  • Hybrid RoPE/NoPE (3:1 ratio - every 4th layer uses NoPE)
  • GQA with 4 groups (32 attention heads, 8 KV heads)
  • Very high rope_theta (5M vs typical 10k-500k)
  • Long context support (64k-128k tokens)
  • Thinking mode support with <think> tags

Implementations:

  • smollm3.rs - Full precision model (safetensors)
  • quantized_smollm3.rs - Quantized GGUF model with weight reconstruction

Available Models:

  • HuggingFaceTB/SmolLM3-3B (Instruct-tuned)
  • HuggingFaceTB/SmolLM3-3B-Base (Base model)
  • unsloth/SmolLM3-3B-GGUF (Quantized: Q4_K_M, Q8_0, F16)

SmolVLM (planned)

Vision-language model variant, to be implemented.

Implementation Details

NoPE Architecture

SmolLM3 uses a mixed approach to positional encoding:

rust
pub fn should_skip_rope(&self, layer_idx: usize) -> bool {
    // Method 1: Explicit array from config
    if let Some(ref no_rope_layers) = self.no_rope_layers {
        if layer_idx < no_rope_layers.len() {
            return no_rope_layers[layer_idx] == 0;
        }
    }
    
    // Method 2: Interval pattern (SmolLM3-3B default)
    // Every 4th layer (indices 3, 7, 11, ...) skips RoPE
    if let Some(interval) = self.no_rope_layer_interval {
        return (layer_idx + 1) % interval == 0;
    }
    
    false // Default: use RoPE
}

Quantized Weight Reconstruction

The quantized implementation includes special handling for Q/K weight reconstruction to maintain compatibility with the GGUF format's interleaved weight storage.

Thinking Mode

SmolLM3 supports explicit reasoning with thinking tags:

  • Enabled: <|im_start|>assistant\n<think>\n (model generates reasoning)
  • Disabled: <|im_start|>assistant\n<think>\n\n</think>\n (skip to answer)

Usage Example

See examples/smollm3/main.rs for a unified implementation that supports both quantized and full precision models with a single codebase.

bash
# Quantized model (recommended)
cargo run --release --example smollm3 -- \
  --model-type quantized \
  --quantization q8_0 \
  --prompt "Explain Rust's ownership system"

# Full precision model
cargo run --release --example smollm3 -- \
  --model-type full \
  --dtype f16 \
  --prompt "Write a sorting algorithm"

# Enable thinking mode
cargo run --release --example smollm3 -- \
  --thinking \
  --prompt "Solve this logic puzzle step by step"

Performance Characteristics

Model TypeSizeSpeedQualityUse Case
Q4_K_M1.9GBFastGoodResource-constrained
Q8_03.3GBFastBetterBalanced
F16 (GGUF)6.2GBMedBestHigh quality GGUF
F16 (Safe)6.2GBMedBestMaximum quality
F32 (Safe)12GBSlowBestResearch/debugging

Credits & Attribution

SmolLM3 Model

Developers

HuggingFace Team (HuggingFaceTB)

The SmolLM family of models represents cutting-edge work in efficient language models, demonstrating that small models can achieve impressive capabilities when trained on high-quality data.

Resources

Key Contributors

The SmolLM project is developed by the HuggingFace team with contributions from researchers focused on efficient LLM architectures and training methods.

NoPE Architecture

Research Paper

Title: "Length Generalization of Causal Transformers without Position Encoding"

Authors:

  • Jie Wang (Fudan University)
  • Tao Ji (Fudan University)
  • Yuanbin Wu (Fudan University)
  • Hang Yan (Fudan University)
  • Tao Gui (Fudan University)
  • Qi Zhang (Fudan University)
  • Xuanjing Huang (Fudan University)
  • Xiaoling Wang (Fudan University)

Published: NeurIPS 2024 (Thirty-Eighth Annual Conference on Neural Information Processing Systems)

Abstract Summary: The paper demonstrates that removing positional encoding from selected layers (NoPE - No Positional Encoding) can improve length generalization in causal transformers while maintaining or improving performance. SmolLM3 implements this with a 3:1 RoPE/NoPE ratio.

Resources:

Key Innovation

The hybrid approach uses:

  • RoPE layers (75%): Standard rotary positional embeddings for local context
  • NoPE layers (25%): No positional encoding for improved length generalization
  • Pattern: Every 4th layer uses NoPE (layers 3, 7, 11, 15, etc.)

This architecture enables SmolLM3 to handle much longer contexts (64k-128k tokens) while maintaining efficiency.

Quantized Models

Unsloth

Quantized GGUF models are provided by Unsloth, a team focused on making LLM inference and fine-tuning more accessible.

Resources:

The quantization work enables running SmolLM3 efficiently on consumer hardware with minimal quality loss.

Implementation Credits

This Candle Implementation

Implemented for: Candle ML Framework
Implementation Date: Nov 2025
Features:

  • Full precision model (F32/F16/BF16)
  • Quantized model (Q4_K_M/Q8_0/F16 GGUF)
  • Unified example supporting both
  • Verified against reference implementations

Verification:

  • Full precision: Validated against HuggingFace Transformers Python implementation
  • Quantized: Validated against llama.cpp implementation

Candle: Minimalist ML framework in Rust by HuggingFace

llama.cpp: Efficient LLM inference in C/C++

HuggingFace Transformers: Reference Python implementation

Acknowledgments

Special thanks to:

  1. HuggingFace Team - For developing SmolLM3 and making it openly available under Apache 2.0 license
  2. NoPE Researchers - For advancing the field with novel positional encoding approaches
  3. Unsloth - For providing optimized quantized versions
  4. Candle Contributors - For building an excellent ML framework in Rust
  5. Open Source Community - For tools like llama.cpp that enable verification and benchmarking

Citation

If you use SmolLM3 in your research or applications, please cite:

SmolLM3 Model

bibtex
@misc{smollm3,
  title={SmolLM3},
  author={HuggingFace Team},
  year={2024},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/HuggingFaceTB/SmolLM3-3B}}
}

NoPE Paper

bibtex
@inproceedings{wang2024length,
  title={Length Generalization of Causal Transformers without Position Encoding},
  author={Wang, Jie and Ji, Tao and Wu, Yuanbin and Yan, Hang and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Wang, Xiaoling},
  booktitle={Thirty-Eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}

Candle Framework

bibtex
@software{candle,
  title={Candle: Minimalist ML Framework},
  author={HuggingFace},
  year={2024},
  url={https://github.com/huggingface/candle}
}

License

  • SmolLM3 Model: Apache 2.0
  • This Implementation: Follows Candle framework license
  • Candle Framework: Apache 2.0 and MIT dual-licensed

Further Reading


This implementation stands on the shoulders of giants. Thank you to all the researchers, engineers, and open source contributors who make this work possible.