LWS

LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference.

vLLM can be deployed with LWS on Kubernetes for distributed model serving.

Prerequisites

At least two Kubernetes nodes, each with 8 GPUs, are required.
Install LWS by following the instructions found here.

Deploy and Serve

Deploy the following yaml file lws.yaml (we have examples that use multiprocessing or Ray):

??? code "lws.yaml" === "Multiprocessing (default)" yaml apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: vllm spec: replicas: 1 leaderWorkerTemplate: size: 2 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: leader spec: containers: - name: vllm-leader image: docker.io/vllm/vllm-openai:latest env: - name: HF_TOKEN value: <your-hf-token> command: - sh - -c - "vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size $(LWS_GROUP_SIZE) --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX) --master-addr $(LWS_LEADER_ADDRESS) --port 8080" resources: limits: nvidia.com/gpu: "8" memory: 1124Gi ephemeral-storage: 800Gi requests: ephemeral-storage: 800Gi cpu: 125 ports: - containerPort: 8080 readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 10 volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi workerTemplate: spec: containers: - name: vllm-worker image: docker.io/vllm/vllm-openai:latest command: - sh - -c - "vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size $(LWS_GROUP_SIZE) --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX) --master-addr $(LWS_LEADER_ADDRESS) --headless" resources: limits: nvidia.com/gpu: "8" memory: 1124Gi ephemeral-storage: 800Gi requests: ephemeral-storage: 800Gi cpu: 125 env: - name: HF_TOKEN value: <your-hf-token> volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi --- apiVersion: v1 kind: Service metadata: name: vllm-leader spec: ports: - name: http port: 8080 protocol: TCP targetPort: 8080 selector: leaderworkerset.sigs.k8s.io/name: vllm role: leader type: ClusterIP

=== "Ray"
    ```yaml
    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      name: vllm
    spec:
      replicas: 1
      leaderWorkerTemplate:
        size: 2
        restartPolicy: RecreateGroupOnPodRestart
        leaderTemplate:
          metadata:
            labels:
              role: leader
          spec:
            containers:
              - name: vllm-leader
                image: docker.io/vllm/vllm-openai:latest
                env:
                  - name: HF_TOKEN
                    value: <your-hf-token>
                command:
                  - sh
                  - -c
                  - "bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
                    vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct --port 8080 --tensor-parallel-size 8 --pipeline-parallel-size 2 --distributed-executor-backend ray"
                resources:
                  limits:
                    nvidia.com/gpu: "8"
                    memory: 1124Gi
                    ephemeral-storage: 800Gi
                  requests:
                    ephemeral-storage: 800Gi
                    cpu: 125
                ports:
                  - containerPort: 8080
                readinessProbe:
                  tcpSocket:
                    port: 8080
                  initialDelaySeconds: 15
                  periodSeconds: 10
                volumeMounts:
                  - mountPath: /dev/shm
                    name: dshm
            volumes:
            - name: dshm
              emptyDir:
                medium: Memory
                sizeLimit: 15Gi
        workerTemplate:
          spec:
            containers:
              - name: vllm-worker
                image: docker.io/vllm/vllm-openai:latest
                command:
                  - sh
                  - -c
                  - "bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
                resources:
                  limits:
                    nvidia.com/gpu: "8"
                    memory: 1124Gi
                    ephemeral-storage: 800Gi
                  requests:
                    ephemeral-storage: 800Gi
                    cpu: 125
                env:
                  - name: HF_TOKEN
                    value: <your-hf-token>
                volumeMounts:
                  - mountPath: /dev/shm
                    name: dshm
            volumes:
            - name: dshm
              emptyDir:
                medium: Memory
                sizeLimit: 15Gi
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-leader
    spec:
      ports:
        - name: http
          port: 8080
          protocol: TCP
          targetPort: 8080
      selector:
        leaderworkerset.sigs.k8s.io/name: vllm
        role: leader
      type: ClusterIP
    ```

bash

kubectl apply -f lws.yaml

Verify the status of the pods:

bash

kubectl get pods

Should get an output similar to this:

bash

NAME       READY   STATUS    RESTARTS   AGE
vllm-0     1/1     Running   0          2s
vllm-0-1   1/1     Running   0          2s

Verify that the distributed tensor-parallel inference works:

=== "Multiprocessing (default)" bash kubectl logs vllm-0 | grep -i "Model loading" kubectl logs vllm-0-1 | grep -i "Model loading"

Should get something similar to this:

POD 0 (PP Rank 0)

```text
(Worker_PP0_TP0 pid=601) INFO 04-28 08:16:58 [gpu_model_runner.py:4820] Model loading took 3.82 GiB memory and 157.996399 seconds
```

POD 1 (PP Rank 1)

```text
(Worker_PP1_TP0 pid=396) INFO 04-28 08:17:09 [gpu_model_runner.py:4820] Model loading took 3.82 GiB memory and 168.878781 seconds
```

=== "Ray" bash kubectl logs vllm-0 | grep -i "Loading model weights took"

Should get something similar to this:

```text
INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB
(RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB
```

Access ClusterIP service

bash

# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service
kubectl port-forward svc/vllm-leader 8080:8080

The output should be similar to the following:

text

Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

Serve the model

Open another terminal and send a request

text

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

The output should be similar to the following

??? console "Output" text { "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d", "object": "text_completion", "created": 1715138766, "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", "choices": [ { "index": 0, "text": " top destination for foodies, with", "logprobs": null, "finish_reason": "length", "stop_reason": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 12, "completion_tokens": 7 } }