content/en/docs/tasks/traffic-management/ingress/gateway-api-inference-extension/index.md
This task describes how to configure Istio to use the Kubernetes Gateway API Inference Extension. The Gateway API Inference Extension aims to improve and standardize routing to self-hosted AI models in Kubernetes. It utilizes CRDs from the Kubernetes Gateway API and leverages Envoy's External Processing filter to extend any Gateway into an inference gateway.
The Gateway API Inference Extension introduces two API types in order to assist with the unique challenges of traffic routing for inference workloads:
InferencePool represents a collection of backends for an inference workload, and contains a reference to an associated endpoint picker service.
The Envoy ext_proc filter is used to route incoming requests to the endpoint picker service in order to make an informed routing decision to an optimal backend in the inference pool.
InferenceObjective allows specifying the serving objectives of the request associated with it.
As the Gateway APIs are a prerequisite for Inference Extension APIs, install both the Gateway API and Gateway API Inference Extension CRDs if they are not present:
{{< text bash >}}
$ kubectl get crd gateways.gateway.networking.k8s.io &> /dev/null ||
{ kubectl kustomize "github.com/kubernetes-sigs/gateway-api/config/crd?ref={{< k8s_gateway_api_version >}}" | kubectl apply -f -; }
$ kubectl get crd inference.networking.k8s.io &> /dev/null ||
{ kubectl kustomize "github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref={{< k8s_gateway_api_inference_extension_version >}}" | kubectl apply -f -; }
{{< /text >}}
Install Istio using the minimal profile:
{{< text bash >}} $ istioctl install --set profile=minimal --set values.pilot.env.SUPPORT_GATEWAY_API_INFERENCE_EXTENSION=true --set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true -y {{< /text >}}
InferencePoolFor a detailed guide on setting up a local test environment, see the Gateway API Inference Extension documentation.
In this example, we will deploy a inference model service using a vLLM simulator, and use an InferencePool and the endpoint picker in order to route requests to individual backends.
Deploy a basic vLLM simulator to behave as our inference workload, and the essential Gateway API resources:
apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: gateway namespace: istio-ingress spec: gatewayClassName: istio listeners:
apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: httproute-for-inferencepool namespace: inference-model-server spec: parentRefs:
Deploy the endpoint picker service and create an InferencePool:
apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: inference-model-reader namespace: inference-model-server rules:
apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: epp-to-inference-model-reader namespace: inference-model-server subjects:
Set the Ingress Host environment variable:
{{< text bash >}} $ kubectl wait -n istio-ingress --for=condition=programmed gateways.gateway.networking.k8s.io gateway $ export INGRESS_HOST=$(kubectl get gateways.gateway.networking.k8s.io gateway -n istio-ingress -ojsonpath='{.status.addresses[0].value}') {{< /text >}}
Send an inference request using curl, you should see a successful response from the backend model server:
{{< text bash >}} $ curl -s -i "http://$INGRESS_HOST/v1/completions" -d '{"model": "reviews-1", "prompt": "What do reviewers think about The Comedy of Errors?", "max_tokens": 100, "temperature": 0}' ... HTTP/1.1 200 OK ... server: istio-envoy ... {"choices":[{"finish_reason":"stop","index":0,"text":"Testing@, #testing 1$ ,2%,3^, [4"}],"created":1770406965,"id":"cmpl-5e508481-7c11-53e8-9587-972a3704724e","kv_transfer_params":null,"model":"reviews-1","object":"text_completion","usage":{"completion_tokens":16,"prompt_tokens":10,"total_tokens":26}} {{< /text >}}
Remove deployments and Gateway API resources:
{{< text bash >}} $ kubectl delete deployment inference-model-server-deployment inference-endpoint-picker -n inference-model-server $ kubectl delete httproute httproute-for-inferencepool -n inference-model-server $ kubectl delete inferencepool inference-model-server-pool -n inference-model-server $ kubectl delete gateways.gateway.networking.k8s.io gateway -n istio-ingress $ kubectl delete ns istio-ingress inference-model-server {{< /text >}}
Uninstall Istio:
{{< text bash >}} $ istioctl uninstall -y --purge $ kubectl delete ns istio-system {{< /text >}}
Remove the Gateway API and Gateway API Inference Extension CRDs if they are no longer needed:
{{< text bash >}} $ kubectl kustomize "github.com/kubernetes-sigs/gateway-api/config/crd?ref={{< k8s_gateway_api_version >}}" | kubectl delete -f - $ kubectl kustomize "github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref={{< k8s_gateway_api_inference_extension_version >}}" | kubectl delete -f - {{< /text >}}