infra/k8s/onboarding.md
Stand up a new Tuist workload cluster (staging / canary / production / preview, or a production Kura regional cluster) on Hetzner via our self-hosted CAPI management cluster, and deploy the Tuist server or Kura controller to it.
We run a management cluster (a single-node Talos VM in Hetzner project tuist-mgmt) that hosts CAPI v1.13 + caph v1.1. You apply Cluster API CRs against it; caph spins up workload nodes in the workload Hetzner project. The mgmt cluster's manifests live in infra/k8s/mgmt/; workload Cluster CRs (and the shared tuist-hcloud ClusterClass) live in infra/k8s/clusters/ and are auto-applied to the mgmt cluster on push to main by mgmt-cluster-apply.yml.
This doc is the runbook for onboarding a new workload cluster end-to-end. The mgmt cluster itself is bootstrapped by the migration PR's runbook; re-bootstrapping it is documented inline in mgmt/tailscale.yaml.
tuist.dev tailnet, with talosctl reachable on the mgmt VM at 100.92.208.109:50000 (see mgmt/tailscale.yaml for tailnet onboarding).kubeconfig: tuist-mgmt in the tuist-k8s-mgmt vault.tuist-workloads (separate from tuist-mgmt) with API access. Token in 1Password as tuist-workloads.cloudflare-tuist-dns. Local bootstrap reads it from the Founders vault; production Kura regional deploys also need the same item in tuist-k8s-production so CI and the Kura controller can read it.cloudflare-tuist-dns token must be able to edit DNS for tuist.dev, read tuist.dev zone metadata, manage zone Load Balancers, and manage account-level Load Balancing pools/monitors.tuist-k8s-staging / tuist-k8s-canary / tuist-k8s-production / tuist-k8s-preview) holding the runtime secrets (MASTER_KEY, TUIST_LICENSE_KEY for preview, Grafana Cloud tokens) and a Service Account token scoped to the vault.mise use -g kubectl helm clusterctl talosctl
op).mkdir -p ~/.kube
op document get "kubeconfig: tuist-mgmt" --vault tuist-k8s-mgmt > ~/.kube/tuist-mgmt.yaml
chmod 600 ~/.kube/tuist-mgmt.yaml
export KUBECONFIG=~/.kube/tuist-mgmt.yaml
kubectl get clusters -n org-tuist
The org-tuist namespace is where every Cluster CR + the hetzner Secret live.
Each workload cluster is a Cluster CR in topology mode referencing the tuist-hcloud ClusterClass. Existing per-env files:
clusters/cluster-staging.yamlclusters/cluster-canary.yamlclusters/cluster-production.yamlclusters/cluster-production-us-east.yamlclusters/cluster-production-us-west.yamlclusters/cluster-preview.yamlFor a new cluster, copy the closest existing file and adjust metadata.name, replica counts, machine types, and any per-pool labels/taints. Variables exposed by the ClusterClass are documented in clusters/README.md. Run mise run k8s:lint-version-drift to confirm topology.version matches the ClusterClass's KUBERNETES_VERSION before applying.
The mgmt-cluster-apply.yml workflow auto-applies anything under infra/k8s/clusters/** on push to main. For an out-of-band apply (e.g. before a PR is merged):
kubectl apply -f infra/k8s/clusters/cluster-<env>.yaml
kubectl -n org-tuist get cluster <name> -w
# Ready=True once control plane is up. ~3–5 min cold start.
Run the k8s:bootstrap-workload task. It is idempotent and handles every step the workload cluster needs before CI deploys can target it. App-serving clusters get the full path (Cilium, HCCM, hcloud-csi, the hetzner Secret on the workload, the platform chart, ESO + the per-env onepassword ClusterSecretStore, the monitoring chart, the app namespace + the Cloudflare origin TLS Secret, and a final ingress smoke test). Production Kura regional clusters install only the shared platform pieces they need, then upload the workload kubeconfig:
mise run k8s:bootstrap-workload <cluster_name> <env> [kubeconfig_item]
# e.g. mise run k8s:bootstrap-workload tuist-canary-2 canary
# e.g. mise run k8s:bootstrap-workload tuist-kura-us-east production "kubeconfig: kura-us-east-1"
On success the script uploads the freshly-minted workload kubeconfig to the per-env 1Password vault. App clusters use the default kubeconfig: tuist-<env> title. Production Kura regional clusters pass explicit titles matching the product cluster IDs: kubeconfig: kura-us-east-1 and kubeconfig: kura-us-west-1.
CI uses a namespace-scoped ServiceAccount with a long-lived token, defined in mgmt/ci-service-account.yaml. Apply it on the workload cluster, mint a kubeconfig, and load it into the GitHub Environment secret:
Skip this section for production Kura regional clusters. The production server deploy workflow reads kubeconfig: kura-us-east-1 and kubeconfig: kura-us-west-1 from the production 1Password vault, syncs them into the main production server namespace for runtime CR writes, and deploys the Kura controller to those regional clusters directly.
WL_KUBECONFIG=~/.kube/<cluster_name>.yaml
APP_NS=tuist-<env> # production uses tuist (no suffix); preview uses preview-system
sed "s/__NAMESPACE__/${APP_NS}/g" infra/k8s/mgmt/ci-service-account.yaml \
| KUBECONFIG="$WL_KUBECONFIG" kubectl apply -f -
SERVER=$(KUBECONFIG="$WL_KUBECONFIG" kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')
CA=$(KUBECONFIG="$WL_KUBECONFIG" kubectl -n "$APP_NS" get secret github-actions-deployer-token -o jsonpath='{.data.ca\.crt}')
TOKEN=$(KUBECONFIG="$WL_KUBECONFIG" kubectl -n "$APP_NS" get secret github-actions-deployer-token -o jsonpath='{.data.token}' | base64 -d)
cat > /tmp/ci-kubeconfig.yaml <<EOF
apiVersion: v1
kind: Config
clusters:
- name: <cluster_name>
cluster:
server: $SERVER
certificate-authority-data: $CA
contexts:
- name: ci
context:
cluster: <cluster_name>
namespace: $APP_NS
user: github-actions-deployer
users:
- name: github-actions-deployer
user:
token: $TOKEN
current-context: ci
EOF
KUBECONFIG=/tmp/ci-kubeconfig.yaml kubectl -n "$APP_NS" get pods # sanity-check
base64 < /tmp/ci-kubeconfig.yaml | gh secret set KUBECONFIG \
--env server-k8s-<env> --repo tuist/tuist
shred -u /tmp/ci-kubeconfig.yaml
# If the workload cluster's control-plane endpoint changes later (for
# example after a load-balancer recreation), re-run the minting flow
# above and refresh the GitHub Environment secret. The kubeconfig
# embeds `clusters[].cluster.server`, so it does not follow endpoint
# changes automatically.
export KUBECONFIG=~/.kube/<cluster_name>.yaml
helm upgrade --install tuist infra/helm/tuist \
-n tuist-<env> --create-namespace \
-f infra/helm/tuist/values-managed-common.yaml \
-f infra/helm/tuist/values-managed-<env>.yaml \
--set server.image.tag="sha-$(git rev-parse --short=12 HEAD)" \
--atomic --timeout 10m
kubectl -n tuist-<env> rollout status deploy/tuist-tuist-server
curl -v https://<env>.tuist.dev/ready
gh workflow run server-deployment.yml -f environment=<env>
The infra/helm/k8s-monitoring/ chart forwards Kubernetes telemetry to Grafana Cloud. The bootstrap task in §4 installs it; the observability-install job in server-deployment.yml keeps it in sync on every deploy. After it lights up, look for the cluster name in Observability → Kubernetes in Grafana Cloud. Verification steps live in infra/helm/k8s-monitoring/README.md.
Preview environments live on the tuist-preview workload cluster, which runs Postgres / ClickHouse / MinIO embedded alongside the server. Each preview is its own Helm release in its own namespace, with auto-deletion driven by a TTL label and an hourly sweep workflow.
In Cloudflare's tuist.dev zone, create a single A record pointing *.preview.tuist.dev at the preview cluster's ingress LB IP (the bootstrap task prints it at the end of step 12). The platform chart issues the wildcard cert via cert-manager DNS-01 against Cloudflare; ingress-nginx picks it up via the --default-ssl-certificate flag set in the chart values.
In the tuist-k8s-preview vault: TUIST_LICENSE_KEY (Login or Password category, password field). The Service Account Auth Token: tuist-preview-k8s 1P item authorizes ESO to read it.
The preview-deploy / preview-sweep workflows scale the preview MachineDeployment from CI via mgmt/preview-mgmt-rbac.yaml. Apply it against the mgmt cluster:
sed 's/__ORG_NS__/org-tuist/g' infra/k8s/mgmt/preview-mgmt-rbac.yaml \
| KUBECONFIG=~/.kube/tuist-mgmt.yaml kubectl apply -f -
Mint a kubeconfig from the preview-scaler-token Secret (same recipe as §5, swapping the SA name + namespace), base64 it into the KUBECONFIG_MGMT GitHub Environment secret on the server-k8s-preview environment, then drop the replicas: 1 pin on cluster-preview.yaml if you want elastic scale-to-zero.
gh workflow run preview-deploy.yml -f pr_number=1234 -f ttl_hours=24
# or, for a one-off commit:
gh workflow run preview-deploy.yml -f commit_sha=abc1234567890... -f ttl_hours=4
The hourly preview-sweep.yml workflow handles deletion + scale-down.
KUBECONFIG=~/.kube/tuist-mgmt.yaml kubectl -n org-tuist delete cluster <cluster_name>
caph drains + deletes the nodes and releases the Hetzner LB and servers.
Cluster stuck in InfrastructureReady: false
kubectl -n org-tuist describe cluster <name>
kubectl -n org-tuist get hetznercluster,hcloudmachine,machine
Most often a bad hetzner Secret in org-tuist (token typo, missing permission). The Secret must hold hcloud=<token> and hcloud-ssh-key-name=<key> (the SSH key uploaded to the workload Hetzner project).
HCloudMachine stuck with ServerCreateFailedIrrecoverableError / "unsupported location"
Hetzner per-DC capacity or server-type stock. Pick a different machine type (patch the Cluster CR's relevant variable) and kubectl delete machine the stuck ones so caph reconciles. For account-level limits, check https://console.hetzner.cloud/your-account/limits.
LoadBalancer stuck in <pending>
HCCM needs the load-balancer.hetzner.cloud/location annotation on the Service to pick a DC. The platform chart sets it; verify with kubectl describe svc. CCM logs:
kubectl -n kube-system logs -l app.kubernetes.io/name=hcloud-cloud-controller-manager
** (ArgumentError) argument error from the server pod
MASTER_KEY is wrong or missing. ESO sync issue or 1Password item name mismatch. Check:
kubectl -n tuist-<env> get externalsecret
kubectl -n tuist-<env> describe externalsecret tuist-master-key
Helm upgrade times out on the migration Job
kubectl -n tuist-<env> logs job/tuist-tuist-server-migrate-<revision>
Usually database connectivity — confirm DATABASE_URL decrypts cleanly and the Postgres host is reachable.
When the workload cluster (not the mgmt one) is misbehaving, work against its kubeconfig:
op document get "kubeconfig: tuist-<env>" --vault tuist-<env> > ~/.kube/tuist-<env>.yaml
chmod 600 ~/.kube/tuist-<env>.yaml
export KUBECONFIG=~/.kube/tuist-<env>.yaml
If 1Password is unhandy, CAPI also keeps a copy on the mgmt cluster:
KUBECONFIG=~/.kube/tuist-mgmt.yaml kubectl -n org-tuist \
get secret tuist-<env>-kubeconfig -o jsonpath='{.data.value}' | base64 -d > ~/.kube/tuist-<env>.yaml
Pods stuck 1/1 Running but Deployment shows 0/N Available
The Node went NotReady and kubelet hasn't confirmed pod state since. Kubernetes' default 300s unreachable toleration gets reset on every brief reconnect, so pods stay pinned for hours. Force them off:
# Strip finalizers off any Terminating pod in the namespace
kubectl -n <ns> get pods -o jsonpath='{range .items[?(@.metadata.deletionTimestamp)]}{.metadata.name}{"\n"}{end}' \
| xargs -I{} kubectl -n <ns> patch pod {} -p '{"metadata":{"finalizers":[]}}' --type=merge
# Force-delete any pod scheduled to an unreachable Node
unreachable=$(kubectl get nodes -o json | jq -r '.items[] | select(.spec.taints // [] | map(.key) | index("node.kubernetes.io/unreachable")) | .metadata.name')
for n in $unreachable; do
kubectl -n <ns> get pods --field-selector spec.nodeName=$n -o name | xargs -r kubectl -n <ns> delete --grace-period=0 --force
done
kubectl -n <ns> rollout restart deployment --all
Cloudflare returns 525 (origin TLS handshake fails) Almost always the ingress LB targeting dead Node IPs. HCCM owns the target list; if it's wedged, kick it:
kubectl -n kube-system rollout restart deployment/hcloud-cloud-controller-manager
# Trigger a fresh reconcile of the ingress Service:
kubectl -n platform annotate svc platform-ingress-nginx-controller "tuist.dev/lb-refresh=$(date -u +%s)" --overwrite
kubectl -n kube-system logs -l app.kubernetes.io/name=hcloud-cloud-controller-manager --tail=80
If HCCM logs unable to parse server id: hcloud://nocloud or no matching server found for node, a stale Node object is starving its reconcile queue:
# Bare-metal Nodes with the bad providerID (fixed at the source, but
# old ones lying around still block HCCM):
kubectl get nodes -o json | jq -r '.items[] | select(.spec.providerID=="hcloud://nocloud") | .metadata.name' \
| xargs -r kubectl delete node
# Cloud-worker ghosts (Node exists, hcloud VM is gone):
kubectl get nodes -o json | jq -r '.items[] | select(.spec.providerID | startswith("hcloud://")) | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name' \
| xargs -r -n1 -I{} sh -c 'kubectl get node {} -o json | jq -e ".spec.providerID | sub(\"hcloud://\";\"\") | tonumber" >/dev/null 2>&1 || echo {}' # validate ID is numeric
# Then `kubectl delete node` the ones whose VM is confirmed gone in
# the Hetzner Cloud console.
Deleting a Node object is safe — CAPI re-creates it on the next reconcile if the underlying VM is still alive.
When a chart bump changes postInstallScript (or any
KubeadmConfigTemplate / HetznerBareMetalMachineTemplate field) and
you need it to take effect before natural Node churn, force a
re-install. Work against the mgmt kubeconfig:
export KUBECONFIG=~/.kube/tuist-mgmt.yaml
CLUSTER=staging # or canary / production
# Find the HBM bound to the cluster's HBMM, and snapshot its
# creationTimestamp — that's how we'll know the controller has
# re-created it (the HBMM name stays the same on re-bind).
HBMM=$(kubectl get hetznerbaremetalmachine -n org-tuist \
-l cluster.x-k8s.io/cluster-name=tuist-$CLUSTER \
-o jsonpath='{.items[0].metadata.name}')
HBM=$(kubectl get hetznerbaremetalhost -n org-tuist \
-o jsonpath="{.items[?(@.spec.consumerRef.name=='$HBMM')].metadata.name}")
OLD_TS=$(kubectl get hetznerbaremetalhost -n org-tuist $HBM \
-o jsonpath='{.metadata.creationTimestamp}')
# Delete the HBM (NOT the HBMM): caph fast-rebinds a fresh HBMM to
# an already-provisioned HBM without re-running installimage, so
# only deleting the HBM forces caph to discard the OS state.
kubectl delete hetznerbaremetalhost -n org-tuist $HBM --wait=false --timeout=2m
# Strip caph's finalizer if the HBMM still references the HBM after
# 2 min — otherwise the HBM lingers and `hetzner-robot-controller`
# can't re-create it cleanly.
kubectl patch hetznerbaremetalhost -n org-tuist $HBM \
-p '{"metadata":{"finalizers":[]}}' --type=merge 2>/dev/null || true
# Wait ~8–15 min for the full cycle:
# (empty) → preparing → registering → image-installing →
# ensure-provisioned → provisioned → kubeadm-joined
# Watch for a fresh creationTimestamp AND HBMM Ready=true:
while sleep 30; do
NEW=$(kubectl get hetznerbaremetalhost -n org-tuist -o jsonpath='{.items[0].metadata.creationTimestamp}')
READY=$(kubectl get hetznerbaremetalmachine -n org-tuist \
-l cluster.x-k8s.io/cluster-name=tuist-$CLUSTER \
-o jsonpath='{.items[0].status.ready}')
echo "$(date +%H:%M:%S) ts=$NEW ready=$READY"
[ "$NEW" != "$OLD_TS" ] && [ "$READY" = "true" ] && break
done
Bare-metal Nodes carry tuist.dev/runner-tier=bare-metal:NoSchedule,
so the only workload on them is idempotent runner Pods — no need to
cordon/drain. The autoscaler reconverges replica count automatically
after the new Node joins. Run a smoke afterward
(linux-runners-staging-smoke.yml) to confirm the new bootstrap is healthy.