docs/deployment/frameworks/lws.md
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference.
vLLM can be deployed with LWS on Kubernetes for distributed model serving.
Deploy the following yaml file lws.yaml (we have examples that use multiprocessing or Ray):
??? code "lws.yaml"
=== "Multiprocessing (default)"
yaml apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: vllm spec: replicas: 1 leaderWorkerTemplate: size: 2 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: leader spec: containers: - name: vllm-leader image: docker.io/vllm/vllm-openai:latest env: - name: HF_TOKEN value: <your-hf-token> command: - sh - -c - "vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size $(LWS_GROUP_SIZE) --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX) --master-addr $(LWS_LEADER_ADDRESS) --port 8080" resources: limits: nvidia.com/gpu: "8" memory: 1124Gi ephemeral-storage: 800Gi requests: ephemeral-storage: 800Gi cpu: 125 ports: - containerPort: 8080 readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 10 volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi workerTemplate: spec: containers: - name: vllm-worker image: docker.io/vllm/vllm-openai:latest command: - sh - -c - "vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size $(LWS_GROUP_SIZE) --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX) --master-addr $(LWS_LEADER_ADDRESS) --headless" resources: limits: nvidia.com/gpu: "8" memory: 1124Gi ephemeral-storage: 800Gi requests: ephemeral-storage: 800Gi cpu: 125 env: - name: HF_TOKEN value: <your-hf-token> volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi --- apiVersion: v1 kind: Service metadata: name: vllm-leader spec: ports: - name: http port: 8080 protocol: TCP targetPort: 8080 selector: leaderworkerset.sigs.k8s.io/name: vllm role: leader type: ClusterIP
=== "Ray"
```yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: vllm
spec:
replicas: 1
leaderWorkerTemplate:
size: 2
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
role: leader
spec:
containers:
- name: vllm-leader
image: docker.io/vllm/vllm-openai:latest
env:
- name: HF_TOKEN
value: <your-hf-token>
command:
- sh
- -c
- "bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct --port 8080 --tensor-parallel-size 8 --pipeline-parallel-size 2 --distributed-executor-backend ray"
resources:
limits:
nvidia.com/gpu: "8"
memory: 1124Gi
ephemeral-storage: 800Gi
requests:
ephemeral-storage: 800Gi
cpu: 125
ports:
- containerPort: 8080
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 15Gi
workerTemplate:
spec:
containers:
- name: vllm-worker
image: docker.io/vllm/vllm-openai:latest
command:
- sh
- -c
- "bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
resources:
limits:
nvidia.com/gpu: "8"
memory: 1124Gi
ephemeral-storage: 800Gi
requests:
ephemeral-storage: 800Gi
cpu: 125
env:
- name: HF_TOKEN
value: <your-hf-token>
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
name: vllm-leader
spec:
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
leaderworkerset.sigs.k8s.io/name: vllm
role: leader
type: ClusterIP
```
kubectl apply -f lws.yaml
Verify the status of the pods:
kubectl get pods
Should get an output similar to this:
NAME READY STATUS RESTARTS AGE
vllm-0 1/1 Running 0 2s
vllm-0-1 1/1 Running 0 2s
Verify that the distributed tensor-parallel inference works:
=== "Multiprocessing (default)"
bash kubectl logs vllm-0 | grep -i "Model loading" kubectl logs vllm-0-1 | grep -i "Model loading"
Should get something similar to this:
POD 0 (PP Rank 0)
```text
(Worker_PP0_TP0 pid=601) INFO 04-28 08:16:58 [gpu_model_runner.py:4820] Model loading took 3.82 GiB memory and 157.996399 seconds
```
POD 1 (PP Rank 1)
```text
(Worker_PP1_TP0 pid=396) INFO 04-28 08:17:09 [gpu_model_runner.py:4820] Model loading took 3.82 GiB memory and 168.878781 seconds
```
=== "Ray"
bash kubectl logs vllm-0 | grep -i "Loading model weights took"
Should get something similar to this:
```text
INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB
(RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB
```
# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service
kubectl port-forward svc/vllm-leader 8080:8080
The output should be similar to the following:
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
Open another terminal and send a request
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
The output should be similar to the following
??? console "Output"
text { "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d", "object": "text_completion", "created": 1715138766, "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", "choices": [ { "index": 0, "text": " top destination for foodies, with", "logprobs": null, "finish_reason": "length", "stop_reason": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 12, "completion_tokens": 7 } }