Back to Tuist

Grafana Kubernetes Monitoring for the Tuist managed cluster

infra/helm/k8s-monitoring/README.md

4.191.86.2 KB
Original Source

Grafana Kubernetes Monitoring for the Tuist managed cluster

Wraps grafana/k8s-monitoring (v4) so the Tuist-managed workload clusters forward the full Kubernetes telemetry picture to Grafana Cloud. What you get out of the box:

SignalSource
Server app metricsAuto-discovered via prometheus.io/scrape=true annotation on the server pods
Server tracesOTLP gRPC :4317 → Grafana Cloud Tempo
Server logsstdout tailed from /var/log/pods by a per-node Alloy DaemonSet → Grafana Cloud Loki
kube-state-metricsDeployed + scraped (workload / pod / deployment / replica state)
node-exporterDeployed as DaemonSet (node CPU / mem / disk / net)
kubelet + cAdvisorScraped (container resource usage)
Kubernetes EventsStreamed to Loki as structured logs

With these in place the Grafana Cloud Observability → Kubernetes app populates automatically (Cluster / Namespace / Workload / Pod / Node views) without importing dashboards by hand.

Install

Installed automatically by the observability-install job in .github/workflows/server-deployment.yml — it runs before every server deploy and is idempotent, so the chart tracks whatever's committed on main. The first deploy against a new cluster brings it up; subsequent deploys are no-op upgrade checks.

Manual install (only needed when bootstrapping a fresh cluster ahead of the first CI deploy, or iterating locally):

bash
helm dependency update infra/helm/k8s-monitoring
helm upgrade --install k8s-monitoring infra/helm/k8s-monitoring \
  -n observability --create-namespace \
  -f infra/helm/k8s-monitoring/values-staging.yaml

Prerequisites:

  1. ClusterSecretStore onepassword exists. Installed once per workload cluster as part of the Tuist chart bootstrap — see k8s/syself-onboarding.md §5.

  2. 1Password items present in the cluster's vault:

    Item nameCategoryField
    PROMETHEUS_TOKENPasswordpassword
    LOKI_TOKENPasswordpassword
    TEMPO_TOKENPasswordpassword
  3. Grafana Cloud endpoints / usernames — baked into values.yaml. Sanity-check they match the stack before installing a fresh cluster.

  4. Worker nodes sized for the footprint. Four Alloy DaemonSets × 2 workers + kube-state-metrics + node-exporter want ~1.5 GB per node on top of the app. Staging/canary clusters run on cpx31 (8 GB/node), production on ccx23 (16 GB/node). cpx22 (4 GB) is too small — a rolling server update can't fit a fresh pod alongside the old one while the Alloy DaemonSets are pinned to the node.

Server-side wiring

The managed Tuist server pushes OTLP spans to the alloy-receiver Service:

http://k8s-monitoring-alloy-receiver.observability.svc.cluster.local:4317

infra/helm/tuist/values-managed-{staging,canary,production}.yaml set TUIST_OTEL_EXPORTER_OTLP_ENDPOINT to this address.

Server pod metrics are discovered automatically: the server Deployment carries prometheus.io/scrape: "true" and prometheus.io/port: "9091", and annotationAutodiscovery picks those up without any static scrape-target config.

What gets deployed

Four Alloy instances, split by role (managed by the upstream alloy-operator):

  • alloy-metrics — scrapes metrics (cluster / node / app) ; runs clustered so replicas hash-partition targets
  • alloy-logs — DaemonSet tailing pod logs from /var/log/pods
  • alloy-singleton — cluster events (singleton so events aren't duplicated)
  • alloy-receiver — OTLP gRPC receiver for the server's traces

Plus the telemetry services themselves:

  • kube-state-metrics Deployment
  • node-exporter DaemonSet

Local validation

bash
helm dependency update infra/helm/k8s-monitoring
helm lint infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-staging.yaml
helm template k8s-monitoring infra/helm/k8s-monitoring \
  -n observability \
  -f infra/helm/k8s-monitoring/values-staging.yaml \
  | kubectl apply --dry-run=client -f -

Verify it's working after install

bash
# All four Alloy StatefulSets / DaemonSets ready
kubectl -n observability get alloy,statefulset,daemonset

# Grafana Cloud token secret materialized
kubectl -n observability get externalsecret,secret k8s-monitoring-grafana-cloud

# Alloy-receiver is listening on :4317
kubectl -n observability get svc k8s-monitoring-alloy-receiver

# Cluster metrics flowing (check from inside alloy-metrics pod)
kubectl -n observability port-forward svc/k8s-monitoring-alloy-metrics 12345:12345 &
curl -s http://localhost:12345/metrics | grep 'prometheus_remote_storage_samples_total{'

In Grafana Cloud: Observability → Kubernetes → Cluster navigation and pick the cluster by name (tuist-staging / tuist-canary / tuist-production).

Label conventions (for dashboards / queries)

Label / attributeWhere it's setApplies to
cluster / k8s.cluster.namek8s-monitoring.cluster.name in overlaysmetrics, logs, traces
envdestinations.*.extraLabels in overlaysmetrics, logs (Loki/Prometheus external labels)
deployment.environmentdestinations.grafana-cloud-traces.processors.attributes.actions in overlaystraces (OTLP resource attribute)

Server-level labels (namespace, pod, container, deployment/statefulset names) are attached automatically by the upstream chart's k8s attribute processor from pod metadata.

RBAC — what access does this chart get?

  • alloy-metrics — cluster-wide get/list/watch on nodes/pods/services/endpoints for target discovery, plus /metrics/cadvisor on kubelets.
  • alloy-logs — node-local hostPath to /var/log/pods. A compromised pod can only read logs from the single node it runs on.
  • alloy-singleton — cluster-wide get/list/watch on events.
  • alloy-receiver — none beyond standard pod execution.
  • kube-state-metrics — cluster-wide read on most core/apps/batch objects (standard for KSM).
  • node-exporter — hostPID, /proc / /sys hostPath (standard for node_exporter).

All cluster-wide reads are metadata only. Grafana Cloud tokens remain in the ESO-managed Secret, not mounted as files.