mlx-server/README.md
A high-performance inference server for MLX models, providing an OpenAI-compatible API for running large language models on Apple Silicon.
# Clone the repository
cd mlx-server
# Build in release mode
xcodebuild -scheme mlx-server -configuration Release
# The binary and metallib will be in the Xcode derived data build products
# Run with a local MLX model
./.build/arm64-apple-macosx/release/mlx-server \
--model "/path/to/your/model" \
--port 8080
| Option | Default | Description |
|---|---|---|
-m, --model | Required | Path to model directory or HuggingFace model ID |
--port | 8080 | HTTP server port |
--ctx-size | 4096 | Context window size |
--api-key | "" | API key for authentication (optional) |
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
"temperature": 0.7,
"max_tokens": 100
}'
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model",
"messages": [{"role": "user", "content": "Tell me a story."}],
"stream": true
}'
curl -X POST http://localhost:8080/v1/cancel
curl http://localhost:8080/v1/models
curl http://localhost:8080/health
mlx-server/
├── Sources/
│ └── MLXServer/
│ ├── MLXServerCommand.swift # CLI entry point
│ ├── ModelRunner.swift # Core inference engine
│ ├── Server.swift # HTTP server & API handlers
│ ├── OpenAITypes.swift # API type definitions
│ └── Logger.swift # Logging utilities
├── Package.swift # Swift package manifest
└── README.md # This file
127.0.0.1 (localhost only) for securityEnsure the model directory contains:
config.json - Model configurationtokenizer.json - Tokenizer vocabularymodel.safetensors or model.safetensors.index.json - Model weightsgeneration_config.json, chat_template.jinjaChange the port:
--port 8081
Use the mlx-lm server benchmark script to measure throughput and latency:
python server_benchmark.py --url http://localhost:8080/v1/chat/completions --model model
This project is part of Jan - an open-source desktop AI application.