Back to Vllm

Helm

docs/deployment/frameworks/helm.md

0.20.17.6 KB
Original Source

Helm

A Helm chart to deploy vLLM for Kubernetes

Helm is a package manager for Kubernetes. It helps automate the deployment of vLLM applications on Kubernetes. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values.

This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for Helm installation and documentation on architecture and values file.

Prerequisites

Before you begin, ensure that you have the following:

  • A running Kubernetes cluster
  • NVIDIA Kubernetes Device Plugin (k8s-device-plugin): This can be found at https://github.com/NVIDIA/k8s-device-plugin
  • Available GPU resources in your cluster
  • (Optional) An S3 bucket or other storage with the model weights, if using automatic model download

Installing the chart

This guide uses the Helm chart at examples/online_serving/chart-helm.

To install the chart with the release name test-vllm:

bash
helm upgrade --install --create-namespace \
  --namespace=ns-vllm test-vllm . \
  -f values.yaml \
  --set secrets.s3endpoint=$ACCESS_POINT \
  --set secrets.s3bucketname=$BUCKET \
  --set secrets.s3accesskeyid=$ACCESS_KEY \
  --set secrets.s3accesskey=$SECRET_KEY

Uninstalling the chart

To uninstall the test-vllm deployment:

bash
helm uninstall test-vllm --namespace=ns-vllm

The command removes all the Kubernetes components associated with the chart including persistent volumes and deletes the release.

Architecture

Values

The following table describes configurable parameters of the chart in values.yaml:

KeyTypeDefaultDescription
autoscalingobject{"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}Autoscaling configuration
autoscaling.enabledboolfalseEnable autoscaling
autoscaling.maxReplicasint100Maximum replicas
autoscaling.minReplicasint1Minimum replicas
autoscaling.targetCPUUtilizationPercentageint80Target CPU utilization for autoscaling
configsobject{}Configmap
containerPortint8000Container port
customObjectslist[]Custom Objects configuration
deploymentStrategyobject{}Deployment strategy configuration
externalConfigslist[]External configuration
extraContainerslist[]Additional containers configuration
extraInitobject{"modelDownload":{"enabled":true},"initContainers":[],"pvcStorage":"1Gi"}Additional configuration for init containers
extraInit.modelDownloadobject{"enabled":true}Model download functionality configuration
extraInit.modelDownload.enabledbooltrueEnable automatic model download job and wait container
extraInit.modelDownload.imageobject{"repository":"amazon/aws-cli","tag":"2.6.4","pullPolicy":"IfNotPresent"}Image for model download operations
extraInit.modelDownload.waitContainerobject{}Wait container configuration (command, args, env)
extraInit.modelDownload.downloadJobobject{}Download job configuration (command, args, env)
extraInit.initContainerslist[]Custom init containers (appended after model download if enabled)
extraInit.pvcStoragestring"1Gi"Storage size for the PVC
extraInit.s3modelpathstring"relative_s3_model_path/opt-125m"(Optional) Path of the model on S3
extraInit.awsEc2MetadataDisabledbooltrue(Optional) Disable AWS EC2 metadata service
extraPortslist[]Additional ports configuration
gpuModelslist["TYPE_GPU_USED"]Type of gpu used
imageobject{"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}Image configuration
image.commandlist["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]Container launch command
image.repositorystring"vllm/vllm-openai"Image repository
image.tagstring"latest"Image tag
livenessProbeobject{"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}Liveness probe configuration
livenessProbe.failureThresholdint3Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
livenessProbe.httpGetobject{"path":"/health","port":8000}Configuration of the kubelet http request on the server
livenessProbe.httpGet.pathstring"/health"Path to access on the HTTP server
livenessProbe.httpGet.portint8000Name or number of the port to access on the container, on which the server is listening
livenessProbe.initialDelaySecondsint15Number of seconds after the container has started before liveness probe is initiated
livenessProbe.periodSecondsint10How often (in seconds) to perform the liveness probe
maxUnavailablePodDisruptionBudgetstring""Disruption Budget Configuration
readinessProbeobject{"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}Readiness probe configuration
readinessProbe.failureThresholdint3Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
readinessProbe.httpGetobject{"path":"/health","port":8000}Configuration of the kubelet http request on the server
readinessProbe.httpGet.pathstring"/health"Path to access on the HTTP server
readinessProbe.httpGet.portint8000Name or number of the port to access on the container, on which the server is listening
readinessProbe.initialDelaySecondsint5Number of seconds after the container has started before readiness probe is initiated
readinessProbe.periodSecondsint5How often (in seconds) to perform the readiness probe
replicaCountint1Number of replicas
resourcesobject{"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}Resource configuration
resources.limits."nvidia.com/gpu"int1Number of GPUs used
resources.limits.cpuint4Number of CPUs
resources.limits.memorystring"16Gi"CPU memory configuration
resources.requests."nvidia.com/gpu"int1Number of GPUs used
resources.requests.cpuint4Number of CPUs
resources.requests.memorystring"16Gi"CPU memory configuration
secretsobject{}Secrets configuration
serviceNamestring""Service name
servicePortint80Service port
labels.environmentstringtestEnvironment name

Configuration Examples

Using S3 Model Download (Default)

yaml
extraInit:
  modelDownload:
    enabled: true
  pvcStorage: "10Gi"
  s3modelpath: "models/llama-7b"

Using Custom Init Containers Only

For use cases like llm-d where you need custom sidecars without model download:

yaml
extraInit:
  modelDownload:
    enabled: false
  initContainers:
    - name: llm-d-routing-proxy
      image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.2.0
      imagePullPolicy: IfNotPresent
      ports:
        - containerPort: 8080
          name: proxy
      securityContext:
        runAsUser: 1000
      restartPolicy: Always
  pvcStorage: "10Gi"