docs/deployment/frameworks/runpod.md
vLLM can be deployed on RunPod, a cloud GPU platform that provides on-demand and serverless GPU instances for AI inference workloads.
runpod/pytorch)SSH into your RunPod pod and launch the vLLM OpenAI-compatible server:
python -m vllm.entrypoints.openai.api_server \
--model <model-name> \
--host 0.0.0.0 \
--port 8000
!!! note
Use `--host 0.0.0.0` to bind to all interfaces so the server is reachable from outside the container.
RunPod exposes HTTP services through its proxy. To make port 8000 accessible:
In the RunPod dashboard, navigate to your pod settings.
Add 8000 to the list of exposed HTTP ports.
After the pod restarts, RunPod provides a public URL in the format:
https://<pod-id>-8000.proxy.runpod.net
A 502 Bad Gateway error from the RunPod proxy typically means the server is not yet listening. Common causes:
--host 0.0.0.0. Binding to 127.0.0.1 (the default) makes the server unreachable from the proxy.--port value matches the port exposed in the RunPod dashboard.--tensor-parallel-size for multi-GPU pods.Once the server is running, test it with a curl request:
!!! console "Command"
```bash
curl https://<pod-id>-8000.proxy.runpod.net/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model-name>",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"max_tokens": 50
}'
```
!!! console "Response"
```json
{
"id": "chat-abc123",
"object": "chat.completion",
"choices": [
{
"message": {
"role": "assistant",
"content": "I'm doing well, thank you for asking! How can I help you today?"
},
"index": 0,
"finish_reason": "stop"
}
]
}
```
You can also check the server health endpoint:
curl https://<pod-id>-8000.proxy.runpod.net/health