Back to Skywalking

SWIP-10 Support Envoy AI Gateway Observability

docs/en/swip/SWIP-10/SWIP.md

10.4.038.6 KB
Original Source

SWIP-10 Support Envoy AI Gateway Observability

Motivation

Envoy AI Gateway is a gateway/proxy for AI/LLM API traffic (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Gemini, etc.) built on top of Envoy Proxy. It provides GenAI-specific observability following OpenTelemetry GenAI Semantic Conventions, including token usage tracking, request latency, time-to-first-token (TTFT), and inter-token latency.

SkyWalking should support monitoring Envoy AI Gateway as a first-class integration, providing:

  1. Metrics monitoring via OTLP push for GenAI metrics.
  2. Access log collection via OTLP log sink for per-request AI metadata analysis.

This is complementary to PR #13745 (agent-based Virtual GenAI monitoring). The agent-based approach monitors LLM calls from the client application side, while this SWIP monitors from the gateway (infrastructure) side. Both can coexist — the AI Gateway provides infrastructure-level visibility regardless of whether the calling application is instrumented.

Architecture Graph

Metrics Path (OTLP Push)

┌─────────────────┐      OTLP gRPC       ┌─────────────────┐
│  Envoy AI       │  ──────────────────>  │  SkyWalking OAP │
│  Gateway        │   (push, port 11800)   │  (otel-receiver) │
│                 │                       │                 │
│  4 GenAI metrics│                       │  MAL rules      │
│  + labels       │                       │  → aggregation  │
└─────────────────┘                       └─────────────────┘

Access Log Path (OTLP Push)

┌─────────────────┐      OTLP gRPC       ┌─────────────────┐
│  Envoy AI       │  ──────────────────>  │  SkyWalking OAP │
│  Gateway        │   (push, port 11800)   │  (otel-receiver) │
│                 │                       │                 │
│  access logs    │                       │  LAL rules      │
│  with AI meta   │                       │  → analysis     │
└─────────────────┘                       └─────────────────┘

The AI Gateway natively supports an OTLP access log sink (via Envoy Gateway's OpenTelemetry sink), pushing structured access logs directly to the OAP's OTLP receiver. No FluentBit or external log collector is needed.

Proposed Changes

1. New Layer: ENVOY_AI_GATEWAY

Add a new layer in Layer.java:

java
/**
 * Envoy AI Gateway is an AI/LLM traffic gateway built on Envoy Proxy,
 * providing observability for GenAI API traffic.
 */
ENVOY_AI_GATEWAY(46, true),

This is a normal layer (isNormal=true) because the AI Gateway is a real, instrumented infrastructure component (similar to KONG, APISIX, NGINX), not a virtual/conjectured service.

2. Entity Model

job_name — Routing Tag for MAL/LAL Rules

The job_name resource attribute is set explicitly in OTEL_RESOURCE_ATTRIBUTES to a fixed value for all AI Gateway deployments. MAL rule filters use it to route metrics to the correct rule set:

yaml
filter: "{ tags -> tags.job_name == 'envoy-ai-gateway' }"

job_name is NOT the SkyWalking service name — it is only used for metric/log routing. The SkyWalking service name comes from OTEL_SERVICE_NAME (standard OTel env var), which is set per deployment.

Service and Instance Mapping

SkyWalking EntitySourceExample
ServiceOTEL_SERVICE_NAME / service.name (per-deployment gateway name)my-ai-gateway
Service Instanceservice.instance.id resource attribute (pod name, set via Downward API)aigw-pod-7b9f4d8c5

Each Kubernetes Gateway deployment sets its own OTEL_SERVICE_NAME (the standard OTel env var) as the SkyWalking service name. Each pod is a service instance identified by service.instance.id.

The job_name resource attribute is set explicitly to the fixed value envoy-ai-gateway for MAL/LAL rule routing. This is separate from service.name — all AI Gateway deployments share the same job_name for routing, but each has its own service.name for entity identity.

The layer (ENVOY_AI_GATEWAY) is set via service.layer resource attribute and used by LAL for log routing. MAL rules use job_name for metric routing.

Provider and model are metric-level labels, not separate entities in this layer. They are used for fine-grained metric breakdowns within the gateway service dashboards rather than being modeled as separate services (unlike the agent-based VIRTUAL_GENAI layer where provider=service, model=instance).

The MAL expSuffix uses the service_name tag as the SkyWalking service name and service_instance_id as the instance name:

yaml
expSuffix: service(['service_name'], Layer.ENVOY_AI_GATEWAY).instance(['service_name', 'service_instance_id'])

Complete Kubernetes Setup Example

The following example shows a complete Envoy AI Gateway deployment configured for SkyWalking observability via OTLP metrics and access logs.

yaml
# 1. GatewayClass — standard Envoy Gateway controller
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy-ai-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
# 2. GatewayConfig — OTLP configuration for SkyWalking
#    One GatewayConfig per gateway. Sets job_name, service name, instance ID,
#    and enables OTLP push for both metrics and access logs.
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: GatewayConfig
metadata:
  name: my-gateway-config
  namespace: default
spec:
  extProc:
    kubernetes:
      env:
        # SkyWalking service name = Gateway CRD name (auto-resolved from pod label)
        # OTEL_SERVICE_NAME is the standard OTel env var for service.name
        - name: GATEWAY_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['gateway.envoyproxy.io/owning-gateway-name']
        - name: OTEL_SERVICE_NAME
          value: "$(GATEWAY_NAME)"
        # OTLP endpoint — SkyWalking OAP gRPC receiver
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://skywalking-oap.skywalking:11800"
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: "grpc"
        - name: OTEL_METRICS_EXPORTER
          value: "otlp"
        - name: OTEL_LOGS_EXPORTER
          value: "otlp"
        # Pod name for instance identity
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        # job_name — fixed routing tag for MAL/LAL rules (same for ALL AI Gateway deployments)
        # service.instance.id — SkyWalking instance name (= pod name)
        # service.layer — routes logs to ENVOY_AI_GATEWAY LAL rules
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "job_name=envoy-ai-gateway,service.instance.id=$(POD_NAME),service.layer=ENVOY_AI_GATEWAY"
---
# 3. Gateway — references the GatewayConfig via annotation
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: my-ai-gateway
  namespace: default
  annotations:
    aigateway.envoyproxy.io/gateway-config: my-gateway-config
spec:
  gatewayClassName: envoy-ai-gateway
  listeners:
    - name: http
      protocol: HTTP
      port: 80
---
# 4. AIGatewayRoute — routing rules + token metadata for access logs
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: my-ai-gateway-route
  namespace: default
spec:
  parentRefs:
    - name: my-ai-gateway
      kind: Gateway
      group: gateway.networking.k8s.io
  # Enable token counts in access logs
  llmRequestCosts:
    - metadataKey: llm_input_token
      type: InputToken
    - metadataKey: llm_output_token
      type: OutputToken
    - metadataKey: llm_total_token
      type: TotalToken
  # Route all models to the backend
  rules:
    - backendRefs:
        - name: openai-backend
---
# 5. AIServiceBackend + Backend — LLM provider
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIServiceBackend
metadata:
  name: openai-backend
  namespace: default
spec:
  schema:
    name: OpenAI
  backendRef:
    name: openai-backend
    kind: Backend
    group: gateway.envoyproxy.io
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: Backend
metadata:
  name: openai-backend
  namespace: default
spec:
  endpoints:
    - fqdn:
        hostname: api.openai.com
        port: 443

Key env var mapping:

Env Var / Resource AttributeSkyWalking ConceptExample Value
OTEL_SERVICE_NAMEService namemy-ai-gateway (auto-resolved from Gateway CRD name)
job_name (in OTEL_RESOURCE_ATTRIBUTES)MAL/LAL rule routingenvoy-ai-gateway (fixed for all deployments)
service.instance.id (in OTEL_RESOURCE_ATTRIBUTES)Instance nameenvoy-default-my-ai-gateway-... (auto-resolved from pod name)
service.layer (in OTEL_RESOURCE_ATTRIBUTES)LAL log routingENVOY_AI_GATEWAY (fixed)

No manual per-gateway configuration needed for service and instance names:

  • GATEWAY_NAME is auto-resolved from the pod label gateway.envoyproxy.io/owning-gateway-name, which is set automatically by the Envoy Gateway controller on every envoy pod.
  • OTEL_SERVICE_NAME uses $(GATEWAY_NAME) substitution to set the per-deployment service name.
  • POD_NAME is auto-resolved from the pod name via the Downward API.

The GatewayConfig.spec.extProc.kubernetes.env field accepts full corev1.EnvVar objects (including valueFrom), merged into the ext_proc container by the gateway mutator webhook. Verified on Kind cluster — the gateway label resolves correctly (e.g., my-ai-gateway).

Important: The resource.WithFromEnv() code path in the AI Gateway (internal/metrics/metrics.go) is conditional — it only executes when OTEL_EXPORTER_OTLP_ENDPOINT is set (or OTEL_METRICS_EXPORTER=console). The ext_proc runs in-process (not as a subprocess), so there is no env var propagation issue.

3. MAL Rules for OTLP Metrics

Create oap-server/server-starter/src/main/resources/otel-rules/envoy-ai-gateway/ with 2 MAL rule files consuming the 4 GenAI metrics from Envoy AI Gateway. Since expSuffix is file-level, service and instance scopes need separate files. Provider and model breakdowns share the same expSuffix as their parent scope, so they are included in the same file.

FileexpSuffixContains
gateway-service.yamlservice(['service_name'], Layer.ENVOY_AI_GATEWAY)Service aggregates + per-provider breakdown + per-model breakdown
gateway-instance.yamlinstance(['service_name'], ['service_instance_id'], Layer.ENVOY_AI_GATEWAY)Instance aggregates + per-provider breakdown + per-model breakdown

All MAL rule files use the job_name filter to match only AI Gateway traffic:

yaml
filter: "{ tags -> tags.job_name == 'envoy-ai-gateway' }"

Source Metrics from AI Gateway

MetricTypeLabels
gen_ai_client_token_usageHistogram (Delta)gen_ai.token.type (input/output), gen_ai.provider.name, gen_ai.response.model, gen_ai.operation.name
gen_ai_server_request_durationHistogramgen_ai.provider.name, gen_ai.response.model, gen_ai.operation.name
gen_ai_server_time_to_first_tokenHistogramgen_ai.provider.name, gen_ai.response.model, gen_ai.operation.name
gen_ai_server_time_per_output_tokenHistogramgen_ai.provider.name, gen_ai.response.model, gen_ai.operation.name

Proposed SkyWalking Metrics

Gateway-level (Service) metrics:

Monitoring PanelUnitMetric NameDescription
Request CPMcount/minmeter_envoy_ai_gw_request_cpmRequests per minute
Request Latency Avgmsmeter_envoy_ai_gw_request_latency_avgAverage request duration
Request Latency Percentilemsmeter_envoy_ai_gw_request_latency_percentileP50/P75/P90/P95/P99 request duration
Input Tokens Ratetokens/minmeter_envoy_ai_gw_input_token_rateInput tokens per minute (total across all models)
Output Tokens Ratetokens/minmeter_envoy_ai_gw_output_token_rateOutput tokens per minute (total across all models)
Total Tokens Ratetokens/minmeter_envoy_ai_gw_total_token_rateTotal tokens per minute
TTFT Avgmsmeter_envoy_ai_gw_ttft_avgAverage time to first token
TTFT Percentilemsmeter_envoy_ai_gw_ttft_percentileP50/P75/P90/P95/P99 time to first token
Time Per Output Token Avgmsmeter_envoy_ai_gw_tpot_avgAverage inter-token latency
Time Per Output Token Percentilemsmeter_envoy_ai_gw_tpot_percentileP50/P75/P90/P95/P99 inter-token latency
Estimated Costcost/minmeter_envoy_ai_gw_estimated_costEstimated cost per minute (from token counts × config pricing)

Per-provider breakdown metrics (service scope):

Monitoring PanelUnitMetric NameDescription
Provider Request CPMcount/minmeter_envoy_ai_gw_provider_request_cpmRequests per minute by provider
Provider Token Usagetokens/minmeter_envoy_ai_gw_provider_token_rateToken rate by provider and token type
Provider Latency Avgmsmeter_envoy_ai_gw_provider_latency_avgAverage latency by provider

Per-model breakdown metrics (service scope):

Monitoring PanelUnitMetric NameDescription
Model Request CPMcount/minmeter_envoy_ai_gw_model_request_cpmRequests per minute by model
Model Token Usagetokens/minmeter_envoy_ai_gw_model_token_rateToken rate by model and token type
Model Latency Avgmsmeter_envoy_ai_gw_model_latency_avgAverage latency by model
Model TTFT Avgmsmeter_envoy_ai_gw_model_ttft_avgAverage TTFT by model
Model TPOT Avgmsmeter_envoy_ai_gw_model_tpot_avgAverage inter-token latency by model

Instance-level (per-pod) aggregate metrics:

Same metrics as service-level but scoped to individual pods via expSuffix: service([...]).instance([...]).

Monitoring PanelUnitMetric NameDescription
Request CPMcount/minmeter_envoy_ai_gw_instance_request_cpmRequests per minute per pod
Request Latency Avgmsmeter_envoy_ai_gw_instance_request_latency_avgAverage request duration per pod
Request Latency Percentilemsmeter_envoy_ai_gw_instance_request_latency_percentileP50/P75/P90/P95/P99 per pod
Input Tokens Ratetokens/minmeter_envoy_ai_gw_instance_input_token_rateInput tokens per minute per pod
Output Tokens Ratetokens/minmeter_envoy_ai_gw_instance_output_token_rateOutput tokens per minute per pod
Total Tokens Ratetokens/minmeter_envoy_ai_gw_instance_total_token_rateTotal tokens per minute per pod
TTFT Avgmsmeter_envoy_ai_gw_instance_ttft_avgAverage TTFT per pod
TTFT Percentilemsmeter_envoy_ai_gw_instance_ttft_percentileP50/P75/P90/P95/P99 TTFT per pod
TPOT Avgmsmeter_envoy_ai_gw_instance_tpot_avgAverage inter-token latency per pod
TPOT Percentilemsmeter_envoy_ai_gw_instance_tpot_percentileP50/P75/P90/P95/P99 TPOT per pod
Estimated Costcost/minmeter_envoy_ai_gw_instance_estimated_costEstimated cost per minute per pod

Per-provider breakdown metrics (instance scope):

Monitoring PanelUnitMetric NameDescription
Provider Request CPMcount/minmeter_envoy_ai_gw_instance_provider_request_cpmRequests per minute by provider per pod
Provider Token Usagetokens/minmeter_envoy_ai_gw_instance_provider_token_rateToken rate by provider per pod
Provider Latency Avgmsmeter_envoy_ai_gw_instance_provider_latency_avgAverage latency by provider per pod

Per-model breakdown metrics (instance scope):

Monitoring PanelUnitMetric NameDescription
Model Request CPMcount/minmeter_envoy_ai_gw_instance_model_request_cpmRequests per minute by model per pod
Model Token Usagetokens/minmeter_envoy_ai_gw_instance_model_token_rateToken rate by model per pod
Model Latency Avgmsmeter_envoy_ai_gw_instance_model_latency_avgAverage latency by model per pod
Model TTFT Avgmsmeter_envoy_ai_gw_instance_model_ttft_avgAverage TTFT by model per pod
Model TPOT Avgmsmeter_envoy_ai_gw_instance_model_tpot_avgAverage inter-token latency by model per pod

Cost Estimation

Reuse the same gen-ai-config.yml pricing configuration from PR #13745. The MAL rules will:

  1. Keep total token counts (input + output) per model from gen_ai_client_token_usage.
  2. Look up per-million-token pricing from config.
  3. Compute estimated_cost = input_tokens × input_cost_per_m / 1_000_000 + output_tokens × output_cost_per_m / 1_000_000.
  4. Amplify by 10^6 (same as PR #13745) to avoid floating point precision issues.

No new MAL function is needed — standard arithmetic operations on counters/gauges are sufficient.

Metrics vs Access Logs for Token Cost

Both data sources provide token counts, but serve different cost analysis purposes:

AspectOTLP Metrics (MAL)Access Logs (LAL)
GranularityAggregated counters — token sums over time windowsPer-request — exact token count for each individual call
Cost outputCost rate (e.g., $X/minute) — good for trends and capacity planningCost per request (e.g., this call cost $0.03) — good for attribution and audit
PrecisionApproximate (counter deltas over scrape intervals)Exact (individual request values)
Use caseDashboard trends, billing estimates, provider comparisonDetect expensive individual requests, cost anomaly alerting, per-user/per-session attribution

The metrics path provides aggregated cost trends. The access log path enables per-request cost analysis — for example, alerting on a single request that consumed an unusually large number of tokens (e.g., a runaway prompt). Both paths reuse the same gen-ai-config.yml pricing data.

4. Access Log Collection via OTLP

The AI Gateway natively supports an OTLP access log sink. When OTEL_LOGS_EXPORTER=otlp (or defaulting to OTLP when OTEL_EXPORTER_OTLP_ENDPOINT is set), Envoy pushes structured access logs directly via OTLP gRPC to the same endpoint as metrics. No FluentBit or external log collector is needed.

AI Gateway Configuration

The OTLP log sink shares the same GatewayConfig CRD env vars as metrics (see Section 2). OTEL_LOGS_EXPORTER=otlp and OTEL_EXPORTER_OTLP_ENDPOINT enable the log sink. The OTEL_RESOURCE_ATTRIBUTES (including job_name, service.instance.id, and service.layer) are injected as resource attributes on each OTLP log record, ensuring consistency between metrics and access logs.

Additionally, enable token metadata population in AIGatewayRoute so token counts appear in access logs:

yaml
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
spec:
  llmRequestCosts:
    - metadataKey: llm_input_token
      type: InputToken
    - metadataKey: llm_output_token
      type: OutputToken
    - metadataKey: llm_total_token
      type: TotalToken

OTLP Log Record Structure (Verified)

Each access log record is pushed as an OTLP LogRecord with the following structure:

Resource attributes (from OTEL_RESOURCE_ATTRIBUTES + Envoy metadata):

AttributeExampleNotes
job_nameenvoy-ai-gatewayFrom OTEL_RESOURCE_ATTRIBUTES — MAL/LAL routing tag
service.instance.idaigw-pod-7b9f4d8c5From OTEL_RESOURCE_ATTRIBUTES — SkyWalking instance name
service.nameenvoy-ai-gatewayFrom OTEL_SERVICE_NAME — SkyWalking service name for logs
node_namedefault-aigw-run-85f8cf28Envoy node identifier
cluster_namedefault/aigw-runEnvoy cluster name

Log record attributes (per-request, LLM traffic):

AttributeExampleDescription
gen_ai.request.modelllama3.2:latestOriginal requested model
gen_ai.response.modelllama3.2:latestActual model from response
gen_ai.provider.nameopenaiBackend provider name
gen_ai.usage.input_tokens31Input token count
gen_ai.usage.output_tokens4Output token count
session.idsess-abc123Session identifier (if set via header mapping)
response_code200HTTP status code
duration1835Request duration (ms)
request.path/v1/chat/completionsAPI path
connection_termination_details-Envoy connection termination reason
upstream_transport_failure_reason-Upstream failure reason

Note: total_tokens is not a separate field in the OTLP log — it equals input_tokens + output_tokens and can be computed in LAL rules. connection_termination_details and upstream_transport_failure_reason serve as error/timeout indicators (replacing response_flags from the file-based log format).

Log record attributes (per-request, MCP traffic):

AttributeExampleDescription
mcp.method.nametools/callMCP method name
mcp.provider.namekiwiMCP provider identifier
jsonrpc.request.id1JSON-RPC request ID
mcp.session.idsess-xyzMCP session ID

LAL Rules — Sampling Policy

Create oap-server/server-starter/src/main/resources/lal/envoy-ai-gateway.yaml to process the OTLP access logs.

Sampling strategy: Not all access logs need to be stored — only those that indicate abnormal or expensive requests. The LAL rules apply the following sampling policy:

  1. High token cost — persist logs where input_tokens + output_tokens >= threshold (default 10,000).
  2. Error responses — always persist logs with response_code >= 400.
  3. Slow/timeout requests — always persist logs where duration exceeds a configurable timeout threshold, or where connection_termination_details / upstream_transport_failure_reason indicate upstream failures. LLM requests are inherently slow (especially streaming), so timeout sampling is important for diagnosing provider availability issues.

All other access logs are dropped to avoid storage bloat.

Industry token usage reference (from OpenRouter State of AI 2025, 100 trillion token study):

Use CaseAvg Input TokensAvg Output TokensAvg Total
Simple chat/Q&A500–1,000200–400~1,000
Customer support500–3,000300–400~2,500
RAG applications3,000–4,000300–500~3,500
Programming/code6,000–20,000+400–1,500~10,000+
Overall average (2025)~6,000~400~6,400

Note: The overall average is heavily skewed by programming workloads. Non-programming use cases (chat, RAG, support) typically fall in the 1,000–3,500 total token range.

Default sampling threshold: 10,000 total tokens (configurable). This is approximately 3× the non-programming median (~3,000), which captures genuinely expensive or abnormal requests without logging every routine call. The threshold is configurable to accommodate different workload profiles:

  • Lower (e.g., 5,000) for chat-heavy deployments where most requests are short.
  • Higher (e.g., 30,000) for code-generation-heavy deployments where large prompts are normal.

The LAL rules would:

  1. Extract AI metadata (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model, gen_ai.provider.name) from OTLP log record attributes.
  2. Compute total_tokens = input_tokens + output_tokens.
  3. Associate logs with the gateway service and instance using resource attributes (service.name, service.instance.id) in the ENVOY_AI_GATEWAY layer.
  4. Apply sampling: persist only logs matching at least one of:
    • total_tokens >= 10,000 (configurable threshold)
    • response_code >= 400
    • duration >= timeout_threshold or non-empty upstream_transport_failure_reason

5. UI Dashboard

OAP side — Create dashboard JSON templates under oap-server/server-starter/src/main/resources/ui-initialized-templates/envoy_ai_gateway/:

  • envoy-ai-gateway-root.json — Root list view of all AI Gateway services.
  • envoy-ai-gateway-service.json — Service dashboard: Request CPM, latency, token rates, TTFT, TPOT, estimated cost, with provider and model breakdown panels.
  • envoy-ai-gateway-instance.json — Instance (pod) level dashboard: Same aggregate metrics as service dashboard but scoped to a single pod, plus per-provider and per-model breakdown panels for that pod.

UI side — A separate PR in skywalking-booster-ui is needed for i18n menu entries (similar to skywalking-booster-ui#534 for Virtual GenAI). The menu entry should be added under the infrastructure/gateway category.

Imported Dependencies libs and their licenses.

No new dependency. The AI Gateway pushes both metrics and access logs via OTLP to SkyWalking's existing otel-receiver.

Compatibility

  • New layer ENVOY_AI_GATEWAY — no breaking change, additive only.
  • New MAL rules — opt-in via configuration.
  • New LAL rules for OTLP access logs — opt-in via configuration.
  • Reuses existing gen-ai-config.yml for cost estimation (shared with agent-based GenAI from PR #13745).
  • No changes to query protocol or storage structure — uses existing meter and log storage.
  • No external log collector (FluentBit, etc.) required — access logs are pushed via OTLP.

General usage docs

Prerequisites

  • Envoy AI Gateway deployed with the GatewayConfig CRD configured (see Section 2 for the full env var setup including OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_RESOURCE_ATTRIBUTES).

Step 1: Configure Envoy AI Gateway

Apply the GatewayConfig CRD from Section 2 to your AI Gateway deployment. Key env vars:

Env VarValuePurpose
OTEL_SERVICE_NAME$(GATEWAY_NAME)SkyWalking service name (per-deployment, auto-resolved from Gateway CRD name)
OTEL_EXPORTER_OTLP_ENDPOINThttp://skywalking-oap:11800SkyWalking OAP OTLP receiver
OTEL_EXPORTER_OTLP_PROTOCOLgrpcOTLP transport
OTEL_METRICS_EXPORTERotlpEnable OTLP metrics push
OTEL_LOGS_EXPORTERotlpEnable OTLP access log push
GATEWAY_NAME(auto from label)Auto-resolved from pod label gateway.envoyproxy.io/owning-gateway-name
POD_NAME(auto from Downward API)Auto-resolved from pod name
OTEL_RESOURCE_ATTRIBUTESjob_name=envoy-ai-gateway,service.instance.id=$(POD_NAME),service.layer=ENVOY_AI_GATEWAYRouting tag (fixed) + instance ID (auto) + layer for LAL routing

Step 2: Configure SkyWalking OAP

Enable the OTel receiver, MAL rules, and LAL rules in application.yml:

yaml
receiver-otel:
  selector: ${SW_OTEL_RECEIVER:default}
  default:
    enabledHandlers: ${SW_OTEL_RECEIVER_ENABLED_HANDLERS:"otlp-metrics,otlp-logs"}
    enabledOtelMetricsRules: ${SW_OTEL_RECEIVER_ENABLED_OTEL_METRICS_RULES:"envoy-ai-gateway"}

log-analyzer:
  selector: ${SW_LOG_ANALYZER:default}
  default:
    lalFiles: ${SW_LOG_LAL_FILES:"envoy-ai-gateway"}

Cost Estimation

Update gen-ai-config.yml with pricing for the models served through the AI Gateway. The same config file is shared with agent-based GenAI monitoring.

Appendix A: OTLP Payload Verification

The following data was verified by capturing raw OTLP payloads from the AI Gateway (envoyproxy/ai-gateway-cli:latest Docker image) via an OTel Collector debug exporter.

Resource Attributes

With OTEL_RESOURCE_ATTRIBUTES=service.instance.id=test-instance-456 and OTEL_SERVICE_NAME=aigw-test-service:

AttributeValueNotes
service.instance.idtest-instance-456Set via OTEL_RESOURCE_ATTRIBUTESconfirmed working
service.nameaigw-test-serviceSet via OTEL_SERVICE_NAME env var
telemetry.sdk.languagegoSDK metadata
telemetry.sdk.nameopentelemetrySDK metadata
telemetry.sdk.version1.40.0SDK metadata

Not present by default (without explicit env config): service.instance.id, job_name, service.layer, host.name. These must be explicitly set via OTEL_RESOURCE_ATTRIBUTES in the GatewayConfig CRD (see Section 2).

resource.WithFromEnv() (source: internal/metrics/metrics.go:35-94) is called inside a conditional block that requires OTEL_EXPORTER_OTLP_ENDPOINT to be set. When configured, OTEL_RESOURCE_ATTRIBUTES is fully honored.

Metric-Level Attributes (Labels)

All 4 metrics carry:

LabelExample ValueNotes
gen_ai.operation.namechatOperation type
gen_ai.original.modelllama3.2:latestOriginal model from request
gen_ai.provider.nameopenaiBackend provider name. In K8s mode with explicit backend routing, this is the configured backend name.
gen_ai.request.modelllama3.2:latestRequested model
gen_ai.response.modelllama3.2:latestModel from response
gen_ai.token.typeinput / output / cached_input / cache_creation_inputOnly on gen_ai.client.token.usage. No total value — total must be computed. cached_input and cache_creation_input are for Anthropic-style prompt caching.

Metric Names and Types

OTLP Metric NameTypeUnitTemporality
gen_ai.client.token.usageHistogram (not Counter!)tokenDelta
gen_ai.server.request.durationHistograms (seconds, not ms!)Delta
gen_ai.server.time_to_first_tokenHistogramsDelta (streaming only)
gen_ai.server.time_per_output_tokenHistogramsDelta (streaming only)

Key findings:

  1. Token usage is a Histogram, not a Counter — Sum/Count/Min/Max available per bucket.
  2. Duration is in seconds — MAL rules must multiply by 1000 for ms display.
  3. Temporality is Delta — MAL needs increase() semantics, not rate().
  4. TTFT and TPOT only appear for streaming requests — non-streaming produces only token.usage + request.duration.
  5. Dots in metric names — OTLP uses dots (gen_ai.client.token.usage), Prometheus converts to underscores.

Histogram Bucket Boundaries (verified from source: internal/metrics/genai.go)

Token usage (14 boundaries, power-of-4): 1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216, 67108864

Request duration (14 boundaries, power-of-2 seconds): 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24, 20.48, 40.96, 81.92

TTFT (21 boundaries, finer granularity for streaming): 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 15.0, 20.0, 30.0, 45.0, 60.0

TPOT (13 boundaries, finest granularity): 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5

Impact on Implementation

FindingImpact
No service.instance.id by defaultOTEL_RESOURCE_ATTRIBUTES=service.instance.id=<value> works when OTLP exporter is configured (verified). MAL rules should treat instance as optional and document OTEL_RESOURCE_ATTRIBUTES configuration.
gen_ai.provider.name = backend nameIn K8s mode with explicit backend config, this is the configured backend name.
Token usage is HistogramMAL uses histogram sum/count, not counter value.
Delta temporalitySkyWalking OTel receiver must handle delta-to-cumulative conversion.
Duration in secondsMAL rules multiply by 1000 for ms-based metrics.
TTFT/TPOT streaming-onlyDashboard should note these metrics may be absent for non-streaming workloads.

Bonus: Traces Also Pushed

The AI Gateway also pushes OpenInference traces via OTLP, including full request/response content in span attributes (llm.input_messages, llm.output_messages, llm.token_count.*). This is a potential future integration point but out of scope for this SWIP.

Appendix B: Raw OTLP Metric Data (Verified)

Captured from OTel Collector debug exporter. This is the actual OTLP payload from envoyproxy/ai-gateway-cli:latest.

Resource Attributes

Resource SchemaURL: https://opentelemetry.io/schemas/1.39.0
Resource attributes:
     -> service.instance.id: Str(test-instance-456)
     -> service.name: Str(aigw-test-service)
     -> telemetry.sdk.language: Str(go)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.40.0)

OTEL_RESOURCE_ATTRIBUTES=service.instance.id=<value> is honored when an OTLP exporter is configured (i.e., OTEL_EXPORTER_OTLP_ENDPOINT is set). Without an OTLP endpoint, the resource block is skipped and only the Prometheus reader is used (which does not carry resource attributes per-metric).

InstrumentationScope

ScopeMetrics SchemaURL:
InstrumentationScope envoyproxy/ai-gateway

Metric 1: gen_ai.client.token.usage (input tokens)

Name: gen_ai.client.token.usage
Description: Number of tokens processed.
Unit: token
DataType: Histogram
AggregationTemporality: Delta

Data point attributes:
     -> gen_ai.operation.name: Str(chat)
     -> gen_ai.original.model: Str(llama3.2:latest)
     -> gen_ai.provider.name: Str(openai)
     -> gen_ai.request.model: Str(llama3.2:latest)
     -> gen_ai.response.model: Str(llama3.2:latest)
     -> gen_ai.token.type: Str(input)
Count: 1
Sum: 31.000000
Min: 31.000000
Max: 31.000000
ExplicitBounds: [1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216, 67108864]

Metric 1b: gen_ai.client.token.usage (output tokens)

Data point attributes:
     -> gen_ai.token.type: Str(output)
     (other attributes same as above)
Count: 1
Sum: 3.000000

Metric 2: gen_ai.server.request.duration

Name: gen_ai.server.request.duration
Description: Generative AI server request duration such as time-to-last byte or last output token.
Unit: s
DataType: Histogram
AggregationTemporality: Delta

Data point attributes:
     -> gen_ai.operation.name: Str(chat)
     -> gen_ai.original.model: Str(llama3.2:latest)
     -> gen_ai.provider.name: Str(openai)
     -> gen_ai.request.model: Str(llama3.2:latest)
     -> gen_ai.response.model: Str(llama3.2:latest)
Count: 1
Sum: 10.432428
ExplicitBounds: [0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24, 20.48, 40.96, 81.92]

Metric 3: gen_ai.server.time_to_first_token (streaming only)

Name: gen_ai.server.time_to_first_token
Description: Time to receive first token in streaming responses.
Unit: s
DataType: Histogram
AggregationTemporality: Delta
(Same attributes as request.duration, excluding gen_ai.token.type)
ExplicitBounds (from source code): [0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5,
                                     0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 15.0, 20.0, 30.0, 45.0, 60.0]

Metric 4: gen_ai.server.time_per_output_token (streaming only)

Name: gen_ai.server.time_per_output_token
Description: Time per output token generated after the first token for successful responses.
Unit: s
DataType: Histogram
AggregationTemporality: Delta
(Same attributes as request.duration, excluding gen_ai.token.type)
ExplicitBounds (from source code): [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5,
                                     0.75, 1.0, 2.5]

Appendix C: Access Log Format (from Envoy Config Dump)

The AI Gateway auto-configures two access log entries on the listener (one for LLM, one for MCP). Verified from config_dump of the AI Gateway.

LLM Access Log Format (JSON)

Filter: request.headers['x-ai-eg-model'] != '' (only logs requests processed by the AI Gateway ext_proc)

json
{
  "start_time": "%START_TIME%",
  "method": "%REQ(:METHOD)%",
  "request.path": "%REQ(:PATH)%",
  "x-envoy-origin-path": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
  "response_code": "%RESPONSE_CODE%",
  "duration": "%DURATION%",
  "bytes_received": "%BYTES_RECEIVED%",
  "bytes_sent": "%BYTES_SENT%",
  "user-agent": "%REQ(USER-AGENT)%",
  "x-request-id": "%REQ(X-REQUEST-ID)%",
  "x-forwarded-for": "%REQ(X-FORWARDED-FOR)%",
  "x-envoy-upstream-service-time": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%",
  "upstream_host": "%UPSTREAM_HOST%",
  "upstream_cluster": "%UPSTREAM_CLUSTER%",
  "upstream_local_address": "%UPSTREAM_LOCAL_ADDRESS%",
  "upstream_transport_failure_reason": "%UPSTREAM_TRANSPORT_FAILURE_REASON%",
  "downstream_remote_address": "%DOWNSTREAM_REMOTE_ADDRESS%",
  "downstream_local_address": "%DOWNSTREAM_LOCAL_ADDRESS%",
  "connection_termination_details": "%CONNECTION_TERMINATION_DETAILS%",
  "gen_ai.request.model": "%REQ(X-AI-EG-MODEL)%",
  "gen_ai.response.model": "%DYNAMIC_METADATA(io.envoy.ai_gateway:model_name_override)%",
  "gen_ai.provider.name": "%DYNAMIC_METADATA(io.envoy.ai_gateway:backend_name)%",
  "gen_ai.usage.input_tokens": "%DYNAMIC_METADATA(io.envoy.ai_gateway:llm_input_token)%",
  "gen_ai.usage.output_tokens": "%DYNAMIC_METADATA(io.envoy.ai_gateway:llm_output_token)%",
  "session.id": "%DYNAMIC_METADATA(io.envoy.ai_gateway:session.id)%"
}

Code review corrections (source: internal/metrics/genai.go, examples/access-log/basic.yaml, site/docs/capabilities/observability/accesslogs.md):

  • response_flags (%RESPONSE_FLAGS%) IS documented in AI Gateway access log docs and used in tests, but not in the default config. Can be added via EnvoyProxy resource if needed.
  • gen_ai.usage.total_tokens IS supported via %DYNAMIC_METADATA(io.envoy.ai_gateway:llm_total_token)% when AIGatewayRoute.spec.llmRequestCosts includes type: TotalToken.
  • Access log format is user-configurable via EnvoyProxy resource, not hardcoded by the AI Gateway. The AI Gateway only populates dynamic metadata; users define which fields appear in logs.
  • Additional token cost types beyond input/output/total: CachedInputToken and CacheCreationInputToken (for Anthropic-style prompt caching, stored as llm_cached_input_token and llm_cache_creation_input_token in dynamic metadata).

MCP Access Log Format (JSON)

Filter: request.headers['x-ai-eg-mcp-backend'] != ''

json
{
  "start_time": "%START_TIME%",
  "method": "%REQ(:METHOD)%",
  "request.path": "%REQ(:PATH)%",
  "response_code": "%RESPONSE_CODE%",
  "duration": "%DURATION%",
  "mcp.method.name": "%DYNAMIC_METADATA(io.envoy.ai_gateway:mcp_method)%",
  "mcp.provider.name": "%DYNAMIC_METADATA(io.envoy.ai_gateway:mcp_backend)%",
  "mcp.session.id": "%REQ(MCP-SESSION-ID)%",
  "jsonrpc.request.id": "%DYNAMIC_METADATA(io.envoy.ai_gateway:mcp_request_id)%",
  "session.id": "%DYNAMIC_METADATA(io.envoy.ai_gateway:session.id)%"
}