docs/getting_started/installation/gpu.apple.inc.md
--8<-- [start:installation]
For GPU-accelerated inference on Apple Silicon, use vLLM-Metal, a community-maintained hardware plugin that uses MLX as the compute backend and provides native GPU acceleration via Apple's Metal framework.
vLLM-Metal works with MLX-optimized models from the mlx-community organization on Hugging Face, which provides quantized versions of popular models optimized for Apple Silicon.
!!! tip For installation and usage instructions, see the Set up using vLLM-Metal section below.
--8<-- [end:installation] --8<-- [start:requirements]
!!! note See the Set up using vLLM-Metal section below for installation instructions.
--8<-- [end:requirements] --8<-- [start:set-up-using-python]
vLLM-Metal is distributed as a separate package that provides native GPU acceleration on Apple Silicon.
To install vLLM-Metal, follow the installation instructions in the vLLM-Metal documentation.
The installation will:
After installation, you can start using vLLM with Metal GPU acceleration.
!!! tip When using vLLM-Metal, use models from the mlx-community on Hugging Face for best performance. These models are optimized for MLX and often include quantized versions (4-bit, 8-bit) that run efficiently on Apple Silicon.
Example model: `mlx-community/Qwen2.5-0.5B-Instruct-4bit`
After installation, vLLM-Metal provides an easy-to-use CLI for running an OpenAI-compatible API server:
# Activate the vLLM-Metal environment
source ~/.venv-vllm-metal/bin/activate
# Start the API server (specify your mlx-community model or it will use default)
vllm serve
Once the server is running, you have multiple options to interact with it:
Open a new terminal and start an interactive chat session:
source ~/.venv-vllm-metal/bin/activate
vllm chat
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # No auth required for local server
)
response = client.chat.completions.create(
model="mlx-community/Qwen2.5-0.5B-Instruct-4bit",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
For more details on the vllm CLI commands, see the OpenAI-compatible server documentation.
--8<-- [end:set-up-using-python] --8<-- [start:pre-built-wheels]
vLLM-Metal is installed via the vLLM-Metal package. See the Set up using vLLM-Metal section above.
--8<-- [end:pre-built-wheels] --8<-- [start:build-wheel-from-source]
For build instructions from source, refer to the vLLM-Metal documentation.
--8<-- [end:build-wheel-from-source] --8<-- [start:pre-built-images]
--8<-- [end:pre-built-images] --8<-- [start:build-image-from-source]
--8<-- [end:build-image-from-source] --8<-- [start:supported-features]
vLLM-Metal provides:
For specific feature support and limitations, refer to the vLLM-Metal documentation.
--8<-- [end:supported-features]