docs/deployment/integrations/kthena.md
Kthena is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high performance and enterprise-grade scalability for LLM inference workloads.
This guide shows how to deploy a production-grade, multi-node vLLM service on Kubernetes.
We’ll:
ModelServing CR.You need:
kubectl access with cluster-admin or equivalent permissions.ModelServing CRD available.helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
This provides the gang-scheduling and network topology features used by Kthena.
helm install kthena oci://ghcr.io/volcano-sh/charts/kthena --version v0.1.0 --namespace kthena-system --create-namespace
kthena-system namespace is created.ModelServing, are installed and healthy.Validate:
kubectl get crd | grep modelserving
You should see:
modelservings.workload.serving.volcano.sh ...
ModelServing ExampleKthena provides an example manifest to deploy a multi-node vLLM cluster running Llama. Conceptually this is equivalent to the vLLM production stack Helm deployment, but expressed with ModelServing.
A simplified version of the example (llama-multinode) looks like:
spec.replicas: 1 – one ServingGroup (one logical model deployment).roles:
entryTemplate – defines leader pods that run:
workerTemplate – defines worker pods to join the leader’s Ray cluster (Ray backend) or to join same distributed process group (multiprocessing backend).Key points from the example YAML:
Image: vllm/vllm-openai:latest (matches upstream vLLM images).
Commands:
??? code "Yaml" === "Multiprocessing (default)" Leader:
```yaml
command:
- sh
- -c
- >
vllm serve meta-llama/Llama-3.1-405B-Instruct
--tensor-parallel-size 8
--pipeline-parallel-size 2
--nnodes=2
--node-rank=0
--master-addr=$(ENTRY_ADDRESS)
--port 8080
```
Worker:
```yaml
command:
- sh
- -c
- >
vllm serve meta-llama/Llama-3.1-405B-Instruct
--tensor-parallel-size 8
--pipeline-parallel-size 2
--nnodes=2
--node-rank=1
--master-addr=$(ENTRY_ADDRESS)
--headless
```
=== "Ray"
Leader:
```yaml
command:
- sh
- -c
- >
bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh
leader --ray_cluster_size=2; python3 -m
vllm.entrypoints.openai.api_server --port 8080 --model
meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 8
--pipeline-parallel-size 2
```
Worker:
```yaml
command:
- sh
- -c
- >
bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh
worker --ray_address=$(ENTRY_ADDRESS)
```
Recommended: use a Secret instead of a raw env var:
kubectl create secret generic hf-token \
-n default \
--from-literal=HUGGING_FACE_HUB_TOKEN='<your-token>'
ModelServingSave one of the following manifests to modelserving.yaml:
??? code "modelserving.yaml"
=== "Multiprocessing (default)"
yaml apiVersion: workload.serving.volcano.sh/v1alpha1 kind: ModelServing metadata: name: llama-multinode namespace: default spec: schedulerName: volcano replicas: 1 # group replicas template: restartGracePeriodSeconds: 60 gangPolicy: minRoleReplicas: 405b: 1 roles: - name: 405b replicas: 2 entryTemplate: spec: containers: - name: leader image: vllm/vllm-openai:latest env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: HUGGING_FACE_HUB_TOKEN command: - sh - -c - "vllm serve meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2 --nnodes 2 --node-rank 0 --master-addr $(ENTRY_ADDRESS) --distributed-executor-backend mp --port 8080" resources: limits: nvidia.com/gpu: "8" memory: 1124Gi ephemeral-storage: 800Gi requests: ephemeral-storage: 800Gi cpu: 125 ports: - containerPort: 8080 readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 10 volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi workerReplicas: 1 workerTemplate: spec: containers: - name: worker image: vllm/vllm-openai:latest command: - sh - -c - "vllm serve meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2 --nnodes 2 --node-rank 1 --master-addr $(ENTRY_ADDRESS) --distributed-executor-backend mp --headless" resources: limits: nvidia.com/gpu: "8" memory: 1124Gi ephemeral-storage: 800Gi requests: ephemeral-storage: 800Gi cpu: 125 env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: HUGGING_FACE_HUB_TOKEN volumeMounts: - mountPath: /dev/shm name: dshm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi
=== "Ray"
```yaml
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelServing
metadata:
name: llama-multinode
namespace: default
spec:
schedulerName: volcano
replicas: 1 # group replicas
template:
restartGracePeriodSeconds: 60
gangPolicy:
minRoleReplicas:
405b: 1
roles:
- name: 405b
replicas: 2
entryTemplate:
spec:
containers:
- name: leader
image: vllm/vllm-openai:latest
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: HUGGING_FACE_HUB_TOKEN
command:
- sh
- -c
- "bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh leader --ray_cluster_size=2;
vllm serve meta-llama/Llama-3.1-405B-Instruct --port 8080 --tensor-parallel-size 8 --pipeline-parallel-size 2"
resources:
limits:
nvidia.com/gpu: "8"
memory: 1124Gi
ephemeral-storage: 800Gi
requests:
ephemeral-storage: 800Gi
cpu: 125
ports:
- containerPort: 8080
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 15Gi
workerReplicas: 1
workerTemplate:
spec:
containers:
- name: worker
image: vllm/vllm-openai:latest
command:
- sh
- -c
- "bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh worker --ray_address=$(ENTRY_ADDRESS)"
resources:
limits:
nvidia.com/gpu: "8"
memory: 1124Gi
ephemeral-storage: 800Gi
requests:
ephemeral-storage: 800Gi
cpu: 125
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: HUGGING_FACE_HUB_TOKEN
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 15Gi
```
kubectl apply -f modelserving.yaml
Kthena will:
ModelServing object.PodGroup for Volcano gang scheduling.ServingGroup and Role.Use the snippet from the Kthena docs:
kubectl get modelserving -oyaml | grep status -A 10
You should see something like:
status:
availableReplicas: 1
conditions:
- type: Available
status: "True"
reason: AllGroupsReady
message: All Serving groups are ready
- type: Progressing
status: "False"
...
replicas: 1
updatedReplicas: 1
List pods for your deployment:
kubectl get pod -owide -l modelserving.volcano.sh/name=llama-multinode
Example output (from docs):
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE ...
default llama-multinode-0-405b-0-0 1/1 Running 0 15m 10.244.0.56 192.168.5.12 ...
default llama-multinode-0-405b-0-1 1/1 Running 0 15m 10.244.0.58 192.168.5.43 ...
default llama-multinode-0-405b-1-0 1/1 Running 0 15m 10.244.0.57 192.168.5.58 ...
default llama-multinode-0-405b-1-1 1/1 Running 0 15m 10.244.0.53 192.168.5.36 ...
Pod name pattern:
llama-multinode-<group-idx>-<role-name>-<replica-idx>-<ordinal>.The first number indicates ServingGroup. The second (405b) is the Role. The remaining indices identify the pod within the role.
Expose the entry via a Service:
apiVersion: v1
kind: Service
metadata:
name: llama-multinode-openai
namespace: default
spec:
selector:
modelserving.volcano.sh/name: llama-multinode
modelserving.volcano.sh/entry: "true"
# optionally further narrow to leader role if you label it
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP
Port-forward from your local machine:
kubectl port-forward svc/llama-multinode-openai 30080:80 -n default
Then:
List models:
curl -s http://localhost:30080/v1/models
Send a completion request (mirroring vLLM production stack docs):
curl -X POST http://localhost:30080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-405B-Instruct",
"prompt": "Once upon a time,",
"max_tokens": 10
}'
You should see an OpenAI-style response from vLLM.
To remove the deployment and its resources:
kubectl delete modelserving llama-multinode -n default
If you’re done with the entire stack:
helm uninstall kthena -n kthena-system # or your Kthena release name
helm uninstall volcano -n volcano-system