candle-transformers/src/models/smol/README.md
This directory contains implementations for the SmolLM family of models developed by HuggingFace.
models/llama)SmolLM2 models (135M, 360M, 1.7B) use the standard Llama3 architecture
and are implemented in models/llama.rs. No separate implementation
is needed.
Variants:
SmolLM3-3B introduces NoPE (No Positional Encoding) which requires
a custom implementation in smollm3.rs.
Key innovations:
<think> tagsImplementations:
smollm3.rs - Full precision model (safetensors)quantized_smollm3.rs - Quantized GGUF model with weight reconstructionAvailable Models:
Vision-language model variant, to be implemented.
SmolLM3 uses a mixed approach to positional encoding:
pub fn should_skip_rope(&self, layer_idx: usize) -> bool {
// Method 1: Explicit array from config
if let Some(ref no_rope_layers) = self.no_rope_layers {
if layer_idx < no_rope_layers.len() {
return no_rope_layers[layer_idx] == 0;
}
}
// Method 2: Interval pattern (SmolLM3-3B default)
// Every 4th layer (indices 3, 7, 11, ...) skips RoPE
if let Some(interval) = self.no_rope_layer_interval {
return (layer_idx + 1) % interval == 0;
}
false // Default: use RoPE
}
The quantized implementation includes special handling for Q/K weight reconstruction to maintain compatibility with the GGUF format's interleaved weight storage.
SmolLM3 supports explicit reasoning with thinking tags:
<|im_start|>assistant\n<think>\n (model generates reasoning)<|im_start|>assistant\n<think>\n\n</think>\n (skip to answer)See examples/smollm3/main.rs for a unified implementation that supports
both quantized and full precision models with a single codebase.
# Quantized model (recommended)
cargo run --release --example smollm3 -- \
--model-type quantized \
--quantization q8_0 \
--prompt "Explain Rust's ownership system"
# Full precision model
cargo run --release --example smollm3 -- \
--model-type full \
--dtype f16 \
--prompt "Write a sorting algorithm"
# Enable thinking mode
cargo run --release --example smollm3 -- \
--thinking \
--prompt "Solve this logic puzzle step by step"
| Model Type | Size | Speed | Quality | Use Case |
|---|---|---|---|---|
| Q4_K_M | 1.9GB | Fast | Good | Resource-constrained |
| Q8_0 | 3.3GB | Fast | Better | Balanced |
| F16 (GGUF) | 6.2GB | Med | Best | High quality GGUF |
| F16 (Safe) | 6.2GB | Med | Best | Maximum quality |
| F32 (Safe) | 12GB | Slow | Best | Research/debugging |
HuggingFace Team (HuggingFaceTB)
The SmolLM family of models represents cutting-edge work in efficient language models, demonstrating that small models can achieve impressive capabilities when trained on high-quality data.
The SmolLM project is developed by the HuggingFace team with contributions from researchers focused on efficient LLM architectures and training methods.
Title: "Length Generalization of Causal Transformers without Position Encoding"
Authors:
Published: NeurIPS 2024 (Thirty-Eighth Annual Conference on Neural Information Processing Systems)
Abstract Summary: The paper demonstrates that removing positional encoding from selected layers (NoPE - No Positional Encoding) can improve length generalization in causal transformers while maintaining or improving performance. SmolLM3 implements this with a 3:1 RoPE/NoPE ratio.
Resources:
The hybrid approach uses:
This architecture enables SmolLM3 to handle much longer contexts (64k-128k tokens) while maintaining efficiency.
Quantized GGUF models are provided by Unsloth, a team focused on making LLM inference and fine-tuning more accessible.
Resources:
The quantization work enables running SmolLM3 efficiently on consumer hardware with minimal quality loss.
Implemented for: Candle ML Framework
Implementation Date: Nov 2025
Features:
Verification:
Candle: Minimalist ML framework in Rust by HuggingFace
llama.cpp: Efficient LLM inference in C/C++
HuggingFace Transformers: Reference Python implementation
Special thanks to:
If you use SmolLM3 in your research or applications, please cite:
@misc{smollm3,
title={SmolLM3},
author={HuggingFace Team},
year={2024},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/HuggingFaceTB/SmolLM3-3B}}
}
@inproceedings{wang2024length,
title={Length Generalization of Causal Transformers without Position Encoding},
author={Wang, Jie and Ji, Tao and Wu, Yuanbin and Yan, Hang and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Wang, Xiaoling},
booktitle={Thirty-Eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}
@software{candle,
title={Candle: Minimalist ML Framework},
author={HuggingFace},
year={2024},
url={https://github.com/huggingface/candle}
}
This implementation stands on the shoulders of giants. Thank you to all the researchers, engineers, and open source contributors who make this work possible.