deploy/gke-marketplace-app/README.md
Table Of Contents
This repository contains Google Kubernetes Engine(GKE) Marketplace Application for NVIDIA Triton Inference Server deployer.
gcloud SDK cli interface could be run on the client and sign in with your GCP credentials.First, install this Triton GKE app to an existing GKE cluster with GPU node pool, Google Cloud Marketplace currently doesn't support auto creation of GPU clusters. User has to run following command to create a compatible cluster (gke version >=1.18.7) with GPU node pools, we recommend user to select T4 or A100(MIG) instances type and choose CPU ratio based on profiling of actual inference workflow.
Users need to follow these instructions to create a kubernetes service account. In this example, we use [email protected]. Make sure it has access to artifact registry and monitoring viewer. For example, to grant access to custom metrics which is required for HPA to work:
gcloud iam service-accounts add-iam-policy-binding --role \
roles/iam.workloadIdentityUser --member \
"serviceAccount:<project-id>.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
<google-service-account>@<project-id>.iam.gserviceaccount.com
kubectl annotate serviceaccount --namespace custom-metrics \
custom-metrics-stackdriver-adapter \
iam.gke.io/gcp-service-account=<google-service-account>@<project-id>.iam.gserviceaccount.com
Currently, GKE >= 1.18.7 only supported in GKE rapid channel, to find the latest version, please visit GKE release notes.
export PROJECT_ID=<your GCP project ID>
export ZONE=<GCP zone of your choice>
export REGION=<GCP region of your choice>
export DEPLOYMENT_NAME=<GKE cluster name, triton-gke for example>
# example: export SERVICE_ACCOUNT="[email protected]"
export SERVICE_ACCOUNT=<Your GKE service account>
gcloud beta container clusters create ${DEPLOYMENT_NAME} \
--addons=HorizontalPodAutoscaling,HttpLoadBalancing \
--service-account=${SERVICE_ACCOUNT} \
--machine-type=n1-standard-8 \
--node-locations=${ZONE} \
--monitoring=SYSTEM \
--zone=${ZONE} \
--subnetwork=default \
--scopes cloud-platform \
--num-nodes 1 \
--project ${PROJECT_ID}
# add GPU node pools, user can modify number of node based on workloads
gcloud container node-pools create accel \
--project ${PROJECT_ID} \
--zone ${ZONE} \
--cluster ${DEPLOYMENT_NAME} \
--service-account=${SERVICE_ACCOUNT} \
--num-nodes 2 \
--accelerator type=nvidia-tesla-t4,count=1 \
--enable-autoscaling --min-nodes 2 --max-nodes 3 \
--machine-type n1-standard-4 \
--disk-size=100 \
--scopes cloud-platform \
--verbosity error
# so that you can run kubectl locally to the cluster
gcloud container clusters get-credentials ${DEPLOYMENT_NAME} --project ${PROJECT_ID} --zone ${ZONE}
# deploy NVIDIA device plugin for GKE to prepare GPU nodes for driver install
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
# make sure you can run kubectl locally to access the cluster
kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user "$(gcloud config get-value account)"
# enable stackdriver custom metrics adaptor
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
# create an ip for ingress traffic
gcloud compute addresses create ingress-triton --global
Creating a cluster and adding GPU nodes could take up-to 10 minutes. Please be patient after executing this command. GPU resources in GCP could be fully utilized, so please try a different zone in case compute resource cannot be allocated. After GKE cluster is running, run kubectl get pods --all-namespaces to make sure the client can access the cluster correctly:
If user would like to experiment with A100 MIG partitioned GPU in GKE, please create node pool with following command:
gcloud beta container node-pools create accel \
--project ${PROJECT_ID} \
--zone ${ZONE} \
--cluster ${DEPLOYMENT_NAME} \
--service-account=${SERVICE_ACCOUNT} \
--num-nodes 1 \
--accelerator type=nvidia-tesla-a100,count=1,gpu-partition-size=1g.5gb \
--enable-autoscaling --min-nodes 1 --max-nodes 2 \
--machine-type=a2-highgpu-1g \
--disk-size=100 \
--scopes cloud-platform \
--verbosity error
Please note that A100 MIG in GKE does not support GPU metrics yet, also Triton GPU Metrics is not compatible with A100 MIG. Hence, please disable GPU metrics by unselect allowGPUMetrics while deploy Triton GKE app. Also for the same reason, this deployer doesn't support inference workfload auto-scaling on A100 MIG as well.
Second, go to this GKE Marketplace link to deploy Triton application.
Users can leave everything as default if their models have already been tested/validated with Triton. They can provide a GCS path pointing to the model repository containing their models. By default, we provide a BERT large model optimized by TensorRT in a public demo GCS bucket that is compatible with the xx.yy release of Triton Server in gs://triton_sample_models/xx_yy. However, please take note of the following about this demo bucket:
us-central1, so loading from this bucket into Triton in other regions may be affected.Where <xx.yy> is the version of NGC Triton container needed.
We want to discuss HPA autoscaling metrics users can leverage. GPU Power(Percentage of Power) tends to be a reliable metric, especially for larger GPU like V100 and A100. GKE currently natively support GPU duty cycle which is GPU utilization in nvidia-smi. We ask users always profile their model to determine the autoscaling target and metrics. When attempting to select the right metrics for autoscaling, the goal should be to pick metrics based on the following: 1, meet SLA rrequirement. 2, give consideration to transient request load, 3, keep GPU as fully utilized as possible. Profiling comes in 2 aspects: If user decided to use Duty Cycle or other GPU metric, it is recommend establish baseline to link SLA requirement such as latency with GPU metrics, for example, for model A, latency will be below 10ms 99% of time when Duty Cycle is below 80% utilized. Additionally, profiling also provide insight to model optimization for inference, with tools like Nsight.
Once the application is deployed successfully, get the public ip from ingress:
> kubectl get ingress
NAME CLASS HOSTS ADDRESS PORTS AGE
triton-external <none> * 35.186.215.182 80 107s
Third, we will try sending request to server with provide client example.
If User selected deploy Triton to accept HTTP request, please launch Locust with Ingress host and port to query Triton Inference Server. In this example script, we send request to Triton server which has loaded a BERT large TensorRT Engine with Sequence length of 128 into GCP bucket. We simulate 1000 concurrent user as target and spawn user at rate of 50 users per second.
locust -f locustfile_bert.py -H http://${INGRESS_HOST}:${INGRESS_PORT}
The client example push about ~650 QPS(Query per second) to Triton Server, and will trigger a auto scale of T4 GPU nodes (We recommend to use T4 and A100[MIG] for inference). From locust UI, we will observer a drop of latency mean and variance for the requests. At the end, after autoscaling, we see the latency stablized at ~200 ms, end to end from US client to europe server, which is excellent for a model that has 345 million parameters. Since for each node, we use 1T4 + n1-standard-4 instance, and it can handle ~450 QPS, with on-demand price, it is ($0.35+$0.19)=$0.54/hr, that translate to 3 million inference per dollar for BERT large model at batch size 1. Further more, with 3 year commitment price, hr rate is ($0.16+$0.08)=$0.24/hr, that translate to 6.75 million inference per dollar.
Alternatively, user can opt to use Perf Analyzer to profile and study the performance of Triton Inference Server. Here we also provide a client script to use Perf Analyzer to send gRPC to Triton Server GKE deployment. Perf Analyzer client requires user to use NGC Triton Client Container.
bash perf_analyzer_grpc.sh ${INGRESS_HOST}:${INGRESS_PORT}
See the following resources to learn more about NVIDIA Triton Inference Server and GKE GPU capabilities.
Documentation