infra/helm/k8s-monitoring/README.md
Wraps grafana/k8s-monitoring (v4) so the Tuist-managed workload clusters forward the full Kubernetes telemetry picture to Grafana Cloud. What you get out of the box:
| Signal | Source |
|---|---|
| Server app metrics | Auto-discovered via prometheus.io/scrape=true annotation on the server pods |
| Server traces | OTLP gRPC :4317 → Grafana Cloud Tempo |
| Server logs | stdout tailed from /var/log/pods by a per-node Alloy DaemonSet → Grafana Cloud Loki |
| kube-state-metrics | Deployed + scraped (workload / pod / deployment / replica state) |
| node-exporter | Deployed as DaemonSet (node CPU / mem / disk / net) |
| kubelet + cAdvisor | Scraped (container resource usage) |
| Kubernetes Events | Streamed to Loki as structured logs |
With these in place the Grafana Cloud Observability → Kubernetes app populates automatically (Cluster / Namespace / Workload / Pod / Node views) without importing dashboards by hand.
Installed automatically by the observability-install job in .github/workflows/server-deployment.yml — it runs before every server deploy and is idempotent, so the chart tracks whatever's committed on main. The first deploy against a new cluster brings it up; subsequent deploys are no-op upgrade checks.
Manual install (only needed when bootstrapping a fresh cluster ahead of the first CI deploy, or iterating locally):
helm dependency update infra/helm/k8s-monitoring
helm upgrade --install k8s-monitoring infra/helm/k8s-monitoring \
-n observability --create-namespace \
-f infra/helm/k8s-monitoring/values-staging.yaml
Prerequisites:
ClusterSecretStore onepassword exists. Installed once per workload cluster as part of the Tuist chart bootstrap — see k8s/syself-onboarding.md §5.
1Password items present in the cluster's vault:
| Item name | Category | Field |
|---|---|---|
PROMETHEUS_TOKEN | Password | password |
LOKI_TOKEN | Password | password |
TEMPO_TOKEN | Password | password |
Grafana Cloud endpoints / usernames — baked into values.yaml. Sanity-check they match the stack before installing a fresh cluster.
Worker nodes sized for the footprint. Four Alloy DaemonSets × 2 workers + kube-state-metrics + node-exporter want ~1.5 GB per node on top of the app. Staging/canary clusters run on cpx31 (8 GB/node), production on ccx23 (16 GB/node). cpx22 (4 GB) is too small — a rolling server update can't fit a fresh pod alongside the old one while the Alloy DaemonSets are pinned to the node.
The managed Tuist server pushes OTLP spans to the alloy-receiver Service:
http://k8s-monitoring-alloy-receiver.observability.svc.cluster.local:4317
infra/helm/tuist/values-managed-{staging,canary,production}.yaml set TUIST_OTEL_EXPORTER_OTLP_ENDPOINT to this address.
Server pod metrics are discovered automatically: the server Deployment carries prometheus.io/scrape: "true" and prometheus.io/port: "9091", and annotationAutodiscovery picks those up without any static scrape-target config.
Four Alloy instances, split by role (managed by the upstream alloy-operator):
alloy-metrics — scrapes metrics (cluster / node / app) ; runs clustered so replicas hash-partition targetsalloy-logs — DaemonSet tailing pod logs from /var/log/podsalloy-singleton — cluster events (singleton so events aren't duplicated)alloy-receiver — OTLP gRPC receiver for the server's tracesPlus the telemetry services themselves:
kube-state-metrics Deploymentnode-exporter DaemonSethelm dependency update infra/helm/k8s-monitoring
helm lint infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-staging.yaml
helm template k8s-monitoring infra/helm/k8s-monitoring \
-n observability \
-f infra/helm/k8s-monitoring/values-staging.yaml \
| kubectl apply --dry-run=client -f -
# All four Alloy StatefulSets / DaemonSets ready
kubectl -n observability get alloy,statefulset,daemonset
# Grafana Cloud token secret materialized
kubectl -n observability get externalsecret,secret k8s-monitoring-grafana-cloud
# Alloy-receiver is listening on :4317
kubectl -n observability get svc k8s-monitoring-alloy-receiver
# Cluster metrics flowing (check from inside alloy-metrics pod)
kubectl -n observability port-forward svc/k8s-monitoring-alloy-metrics 12345:12345 &
curl -s http://localhost:12345/metrics | grep 'prometheus_remote_storage_samples_total{'
In Grafana Cloud: Observability → Kubernetes → Cluster navigation and pick the cluster by name (tuist-staging / tuist-canary / tuist-production).
| Label / attribute | Where it's set | Applies to |
|---|---|---|
cluster / k8s.cluster.name | k8s-monitoring.cluster.name in overlays | metrics, logs, traces |
env | destinations.*.extraLabels in overlays | metrics, logs (Loki/Prometheus external labels) |
deployment.environment | destinations.grafana-cloud-traces.processors.attributes.actions in overlays | traces (OTLP resource attribute) |
Server-level labels (namespace, pod, container, deployment/statefulset names) are attached automatically by the upstream chart's k8s attribute processor from pod metadata.
alloy-metrics — cluster-wide get/list/watch on nodes/pods/services/endpoints for target discovery, plus /metrics/cadvisor on kubelets.alloy-logs — node-local hostPath to /var/log/pods. A compromised pod can only read logs from the single node it runs on.alloy-singleton — cluster-wide get/list/watch on events.alloy-receiver — none beyond standard pod execution.kube-state-metrics — cluster-wide read on most core/apps/batch objects (standard for KSM).node-exporter — hostPID, /proc / /sys hostPath (standard for node_exporter).All cluster-wide reads are metadata only. Grafana Cloud tokens remain in the ESO-managed Secret, not mounted as files.