candle-wasm-examples/quant-qwen3/README.md
A high-performance WebAssembly implementation of the Qwen3-0.6B language model running entirely in the browser. This project demonstrates efficient on-device inference using Rust, WASM, and the Candle ML framework with SIMD optimizations.
Running on a modern CPU with WASM SIMD support:
| Quantization | Speed | Model Size | Quality |
|---|---|---|---|
| Q8_0 (default) | 8.7 tok/s | ~645MB | Best |
| Q4_K_M | 5.8 tok/s | ~380MB | Good |
Q8_0 provides superior quality with better throughput despite larger size, making it the recommended choice.
Performance Note: Having browser DevTools/console open can significantly reduce inference speed (up to 50% slower). For best performance, close the console during generation and only open it when you need to view profiling stats.
pip install huggingface-hub tqdm
cargo install wasm-packwasm-pack build --target web --release
./serve.py
The server will:
Navigate to http://localhost:8080 and start generating text!
# Use default Q8_0 model
./serve.py
# Use smaller Q4_K_M model (faster download, lower quality)
./serve.py --model 0.6b-q4
# Change port
./serve.py --port 3000
# Use custom GGUF model file
./serve.py --path /path/to/custom-model.gguf
./serve.py --help
Options:
--model, -m: Choose model variant (0.6b-q8 or 0.6b-q4)--path, -p: Path to custom GGUF model file--port: Server port (default: 8080)--list-models: Show available models and exit./serve.py --list-models
Output:
Available models:
0.6b-q8:
Size: ~645MB
Description: 8-bit quantization (best quality)
File: Qwen3-0.6B-Q8_0.gguf
0.6b-q4:
Size: ~380MB
Description: 4-bit quantization (smaller, faster)
File: Qwen3-0.6B-Q4_K_M.gguf
.
├── src/
│ ├── lib.rs # WASM bindings
│ ├── m.rs # Model implementation
│ └── profiler.rs # Performance profiler
├── index.html # Web interface
├── serve.py # Development server with auto-download
├── Cargo.toml # Rust dependencies
├── .cargo/
│ └── config.toml # WASM build config (SIMD flags)
└── pkg/ # Generated WASM (after build)
The interface includes several tools for monitoring and debugging performance:
Prints detailed performance profiling data to the browser console, including:
When to use: After generation to analyze which operations are bottlenecks
Resets all accumulated profiling data to start fresh measurements.
When to use: Before running a benchmark or when you want to measure a specific generation without previous data
Refreshes the memory display showing:
When to use: To check current memory consumption, especially useful for:
Example workflow:
The project uses WebAssembly SIMD128 instructions for accelerated matrix operations. The SIMD feature is enabled in config.toml:
[target.wasm32-unknown-unknown]
rustflags = [
'-C', 'target-feature=+simd128',
]
Models use GGUF format with different quantization schemes:
wasm-pack build --target web --dev
Open browser console after generation to see detailed timing breakdown:
// In browser console
showProfile() // Print performance stats
clearProfile() // Reset profiler
updateMemory() // Check memory usage
This implementation is provided as-is. Please refer to the original Qwen3 license for model usage terms.
Built using Rust, WebAssembly, and the Candle framework