rust/README.md
This is a Rust drop-in alternative frontend for vLLM. The current goal is to rebuild the northbound serving layer in Rust while still talking to the core Python vLLM engine process(es) via ZMQ over the existing engine boundary.
It should still be considered experimental, and is not feature-complete. We are working to add more functionality from the python front-end.
See https://github.com/Inferact/vllm-frontend-rs for the original commit history before it was moved into the main vllm repo.
The component is organized as a Cargo workspace with several crates, layered bottom-up:
┌─────────────────────────────────┐
│ vllm-cmd / vllm-rs │ CLI entrypoint:
│ │ Python vLLM frontend subprocess
│ │ Rust managed-engine serve mode
├─────────────────────────────────┤
│ vllm-server │ OpenAI-compatible HTTP API (axum)
├─────────────────────────────────┤
│ vllm-chat │ Chat completions: template rendering,
│ │ structured assistant events,
│ │ reasoning & tool parsing
├─────────────────────────────────┤
│ vllm-text │ Tokenizer & incremental detokenizer
├─────────────────────────────────┤
│ vllm-llm │ Thin token-in/token-out facade over
│ │ the engine client
├─────────────────────────────────┤
│ vllm-engine-core-client │ ZMQ transport + MessagePack protocol
│ │ for the headless vLLM engine
└─────────────────────────────────┘
vllm-rs integrates into Python vllm as a Rust frontend subprocess.
Python owns process startup and launches the Rust API server as a Python-supervised worker, while
passing the inherited listening socket and transport addresses into vllm-rs.
For example:
VLLM_USE_RUST_FRONTEND=1 vllm serve Qwen/Qwen3-0.6B
vllm-rs serve can be run standalone with --data-parallel-size-local 0 when the Python engines
are started elsewhere and this node should run only the Rust frontend. The frontend still uses
the global --data-parallel-size to determine how many engines it expects to join the shared handshake.
vllm serve Qwen/Qwen3-0.6B \
--headless \
--data-parallel-address 127.0.0.1 \
--data-parallel-rpc-port 62100 \
--data-parallel-size 1 \
--data-parallel-size-local 1
Then start the Rust frontend-only server:
vllm-rs serve Qwen/Qwen3-0.6B \
--data-parallel-address 127.0.0.1 \
--data-parallel-rpc-port 62100 \
--data-parallel-size 1 \
--data-parallel-size-local 0
To build the vllm-rs in isolation:
# from the local checkout
cargo install --path src/cmd --bin vllm-rs
After either startup path, you can use any OpenAI-compatible client:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"stream": true
}'