infra/k8s/syself-onboarding.md
Stand up a Syself-managed Kubernetes cluster on Hetzner and deploy the Tuist server to it.
Syself Apalla runs a management cluster as SaaS. You apply Cluster API CRs against that management cluster, and the caph provider spins up nodes in your Hetzner project. You never operate the management cluster — you just talk to it.
Target here is staging. Production follows the same shape with HA tuning bumped.
kubectl call.Zone.DNS:Edit on tuist.dev.mise use -g kubectl helm clusterctl
kubectl krew install oidc-login
op).Download the kubeconfig template from https://syself.com/docs/hetzner/apalla/getting-started/accessing-the-management-cluster. It uses kubectl oidc-login as the auth plugin — the first kubectl invocation opens a browser, you SSO in, and a short-lived token is cached in ~/.kube/cache/oidc-login/.
# Save the kubeconfig you downloaded. The `namespace:` field in the
# context must be set to `org-tuist` (our Syself-managed org namespace).
mkdir -p ~/.kube
cp /path/to/syself-mgmt-kubeconfig.yaml ~/.kube/tuist-syself-mgmt.yaml
chmod 600 ~/.kube/tuist-syself-mgmt.yaml
export KUBECONFIG=~/.kube/tuist-syself-mgmt.yaml
# Triggers the browser-based OIDC flow on first use.
kubectl get clusters
Logout (invalidate cached token) later with:
rm -rf ~/.kube/cache/oidc-login
Syself provisions into your Hetzner Cloud account. We currently use cloud VMs only (no Hetzner Robot bare-metal) — see the migration PR's "Why not bare metal" section for the reasoning.
tuist-syself). All three managed clusters share one Hetzner project because Syself's ClusterClass hardcodes the Kubernetes Secret name that holds the API token.op item create --vault Founders --category='API Credential' \
--title='hetzner-tuist-syself' \
credential='<paste token>'
kubectl debug-free troubleshooting):
ssh-keygen -t ed25519 -f ~/.ssh/tuist-syself -N ''
# Upload the .pub contents via Hetzner Cloud console: Security → SSH Keys → Add.
Syself's caph provider reads this Secret when it provisions nodes. It lives in our org namespace on the management cluster (org-tuist). One Secret, shared across all workload clusters — see the Cluster CR file for context on why.
export KUBECONFIG=~/.kube/tuist-syself-mgmt.yaml
ORG_NS="$(kubectl config view --minify -o jsonpath='{.contexts[0].context.namespace}')"
kubectl -n "$ORG_NS" create secret generic hetzner \
--from-literal=hcloud="$(op read 'op://Founders/hetzner-tuist-syself/credential')" \
--from-literal=hcloud-ssh-key-name=tuist-syself
# Upload the SSH public key to Hetzner Cloud via API (the ClusterClass
# attaches it to every VM by name).
TOKEN=$(kubectl -n "$ORG_NS" get secret hetzner -o jsonpath='{.data.hcloud}' | base64 -d)
curl -sX POST https://api.hetzner.cloud/v1/ssh_keys \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "$(jq -n --arg key "$(cat ~/.ssh/tuist-syself.pub)" '{name:"tuist-syself", public_key:$key}')"
Verify:
kubectl -n "$ORG_NS" get secret hetzner
The per-env Cluster CRs are checked in at infra/k8s/syself/workload-cluster-{staging,canary,production}.yaml. Shape (staging shown; canary mirrors it; production swaps the worker type):
fsn1 (Falkenstein)cpx22 cloud VMs (HA, tolerates 1 node failure, enables zero-downtime upgrades)cpx22 (staging/canary) or 2 × ccx23 (production, dedicated vCPU)topology.class follow the Syself docs example for Kubernetes 1.34Before applying:
topology.class matches a currently-available ClusterStackRelease:
kubectl get clusterstackreleases
class + version together if a newer release is out.kubectl apply -f infra/k8s/syself/workload-cluster-staging.yaml
Wait for it to come up — 10–20 minutes for the first boot:
kubectl -n "$ORG_NS" get cluster tuist-staging -w
# Ready=True once all nodes, control plane, and CNI are running.
kubectl -n "$ORG_NS" describe cluster tuist-staging
# Good for diagnosing stuck phases (InfrastructureReady, ControlPlaneReady, …).
Fetch the workload cluster kubeconfig directly from the CAPI-managed Secret (we avoid clusterctl get kubeconfig because the pinned v1.13 CLI is built for v1beta2 management clusters, while Syself's is v1beta1):
kubectl -n "$ORG_NS" get secret tuist-staging-kubeconfig -o jsonpath='{.data.value}' \
| base64 -d > ~/.kube/tuist-staging.yaml
chmod 600 ~/.kube/tuist-staging.yaml
# Switch to the workload cluster.
export KUBECONFIG=~/.kube/tuist-staging.yaml
kubectl get nodes
# Expect 3 control-plane + 2 workers, all Ready.
MASTER_KEY (decrypts priv/secrets/stag.yml.enc baked into the image) is synced from 1Password via external-secrets-operator. The chart's externalSecrets block in values-managed-common.yaml already references a ClusterSecretStore named onepassword. Install ESO + the store once per workload cluster:
# 1) ESO CRDs + controller
helm repo add external-secrets https://charts.external-secrets.io
helm upgrade --install external-secrets external-secrets/external-secrets \
-n external-secrets --create-namespace \
--set installCRDs=true
# 2) Stash the 1Password Service Account token in the cluster. The SA must
# have read access to the tuist-k8s-staging vault which holds MASTER_KEY
# (plus the Grafana Cloud tokens the k8s-monitoring chart consumes).
kubectl create namespace onepassword --dry-run=client -o yaml | kubectl apply -f -
kubectl -n onepassword create secret generic onepassword-sa-token \
--from-literal=token="$(op read 'op://Founders/<1p-item-uuid>/credential')" \
--dry-run=client -o yaml | kubectl apply -f -
# 3) Wire it up as a ClusterSecretStore.
cat <<'YAML' | kubectl apply -f -
apiVersion: external-secrets.io/v1
kind: ClusterSecretStore
metadata:
name: onepassword
spec:
provider:
onepasswordSDK:
vault: tuist-k8s-staging
auth:
serviceAccountSecretRef:
name: onepassword-sa-token
namespace: onepassword
key: token
YAML
# 4) Confirm it went Ready.
kubectl get clustersecretstore onepassword
# NAME READY
# onepassword True
The Tuist chart's ExternalSecret resource will pick MASTER_KEY up automatically when Helm installs in the next section.
GitHub Actions deploys via a namespace-scoped ServiceAccount with a long-lived token (the Syself-documented headless pattern). The manifest is checked in at infra/k8s/syself/ci-service-account.yaml.
export KUBECONFIG=~/.kube/tuist-staging.yaml
kubectl apply -f infra/k8s/syself/ci-service-account.yaml
# Build the kubeconfig CI will use. Embed the CA and token so the file is
# self-contained; no reliance on the caller's context.
SERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')
CA=$(kubectl -n tuist-staging get secret github-actions-deployer-token -o jsonpath='{.data.ca\.crt}')
TOKEN=$(kubectl -n tuist-staging get secret github-actions-deployer-token -o jsonpath='{.data.token}' | base64 -d)
cat > /tmp/ci-kubeconfig.yaml <<EOF
apiVersion: v1
kind: Config
clusters:
- name: tuist-staging
cluster:
server: $SERVER
certificate-authority-data: $CA
contexts:
- name: ci
context:
cluster: tuist-staging
namespace: tuist-staging
user: github-actions-deployer
users:
- name: github-actions-deployer
user:
token: $TOKEN
current-context: ci
EOF
# Sanity-check.
KUBECONFIG=/tmp/ci-kubeconfig.yaml kubectl -n tuist-staging get pods
# Load into the GitHub Environment secret.
base64 < /tmp/ci-kubeconfig.yaml | gh secret set KUBECONFIG \
--env server-k8s-staging --repo tuist/tuist
shred -u /tmp/ci-kubeconfig.yaml
The token is persistent — revoke by deleting the github-actions-deployer-token Secret, or rotate by recreating it.
export KUBECONFIG=~/.kube/tuist-staging.yaml
cd infra/helm/tuist
helm upgrade --install tuist . \
-n tuist-staging --create-namespace \
-f values-managed-common.yaml \
-f values-managed-staging.yaml \
--set server.image.tag="sha-$(git rev-parse --short=12 HEAD)" \
--atomic --timeout 10m
Watch the rollout:
kubectl -n tuist-staging rollout status deploy/tuist-tuist-server
kubectl -n tuist-staging logs -l app.kubernetes.io/component=server -f
Once the ingress-nginx LB is up and DNS is pointing at it:
curl -v https://staging.tuist.dev/ready
After the manual smoke test passes:
gh workflow run server-deployment.yml -f environment=staging
The in-cluster infra/helm/k8s-monitoring/ chart forwards the full Kubernetes telemetry picture (cluster / pod / node metrics, events, pod logs, server traces) to Grafana Cloud. It's installed automatically by the observability-install job in .github/workflows/server-deployment.yml — no manual helm install needed, the first server deploy against a new cluster brings it up.
Prerequisite: the ClusterSecretStore from §5 must already exist before the first CI deploy — the chart pulls the three Grafana Cloud tokens (Prometheus, Loki, Tempo) through it.
If you do need to run it by hand (e.g. bringing up a fresh cluster before the first CI deploy, or debugging locally):
helm dependency update infra/helm/k8s-monitoring
helm upgrade --install k8s-monitoring infra/helm/k8s-monitoring \
-n observability --create-namespace \
-f infra/helm/k8s-monitoring/values-staging.yaml
After the chart is live, check Observability → Kubernetes in Grafana Cloud for the cluster named tuist-staging / tuist-canary / tuist-production. Full verification steps live in infra/helm/k8s-monitoring/README.md.
Preview environments live on a dedicated workload cluster (tuist-preview) that runs everything embedded — Postgres, ClickHouse, MinIO — alongside the server. Each preview is its own Helm release in its own namespace, with auto-deletion driven by a TTL label and an hourly sweep workflow. The worker pool scales to 0 when no previews are live.
This section is the one-time bootstrap. Once it's done, deploys are purely a GitHub Actions affair — Actions → Preview Deploy → Run workflow.
tuist-k8s-preview 1Password vault, with item TUIST_LICENSE_KEY (Login or Password category, value in the password field).Founders as 1Password SA — tuist-k8s-preview.Zone.DNS:Edit on tuist.dev (you already have one for the other clusters — reuse).export KUBECONFIG=~/.kube/tuist-syself-mgmt.yaml
ORG_NS="$(kubectl config view --minify -o jsonpath='{.contexts[0].context.namespace}')"
kubectl apply -f infra/k8s/syself/workload-cluster-preview.yaml
kubectl -n "$ORG_NS" get cluster tuist-preview -w
# Ready=True once control plane is up. Note: workers default to replicas: 0,
# so initially you'll see 1 control-plane and 0 workers — that's expected.
Fetch the workload kubeconfig the same way as staging:
kubectl -n "$ORG_NS" get secret tuist-preview-kubeconfig -o jsonpath='{.data.value}' \
| base64 -d > ~/.kube/tuist-preview.yaml
chmod 600 ~/.kube/tuist-preview.yaml
export KUBECONFIG=~/.kube/tuist-preview.yaml
helm upgrade --install external-secrets external-secrets/external-secrets \
-n external-secrets --create-namespace \
--set installCRDs=true
kubectl create namespace onepassword --dry-run=client -o yaml | kubectl apply -f -
kubectl -n onepassword create secret generic onepassword-sa-token \
--from-literal=token="$(op read 'op://Founders/1Password SA — tuist-k8s-preview/password')" \
--dry-run=client -o yaml | kubectl apply -f -
cat <<'YAML' | kubectl apply -f -
apiVersion: external-secrets.io/v1
kind: ClusterSecretStore
metadata:
name: onepassword
spec:
provider:
onepasswordSDK:
vault: tuist-k8s-preview
auth:
serviceAccountSecretRef:
name: onepassword-sa-token
namespace: onepassword
key: token
YAML
kubectl get clustersecretstore onepassword
# Expect READY=True.
In Cloudflare's tuist.dev zone, create a single A record:
*.preview.tuist.dev A <preview cluster ingress IP>
The ingress IP is the LB Hetzner provisions when ingress-nginx is installed (next step). Bring up ingress-nginx first, grab the LB IP from kubectl get svc -n ingress-nginx ingress-nginx-controller, then create the record.
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
-n ingress-nginx --create-namespace \
-f infra/k8s/syself/ingress-nginx-values.yaml \
--set controller.extraArgs.default-ssl-certificate=ingress-nginx/preview-tuist-dev-wildcard-tls
kubectl -n ingress-nginx get svc ingress-nginx-controller -w
# Wait for EXTERNAL-IP, then create the *.preview.tuist.dev A record.
Issue the wildcard cert via cert-manager DNS-01 (re-uses the same Cloudflare token pattern as the other clusters):
helm repo add jetstack https://charts.jetstack.io
helm upgrade --install cert-manager jetstack/cert-manager \
-n cert-manager --create-namespace \
--set installCRDs=true
# Cloudflare token Secret (same item the other clusters reuse).
kubectl -n cert-manager create secret generic cloudflare-api-token \
--from-literal=api-token="$(op read 'op://Founders/cloudflare-tuist-dev-dns/credential')"
cat <<'YAML' | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-cloudflare
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: [email protected]
privateKeySecretRef:
name: letsencrypt-cloudflare-account
solvers:
- dns01:
cloudflare:
apiTokenSecretRef:
name: cloudflare-api-token
namespace: cert-manager
key: api-token
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: preview-tuist-dev-wildcard
namespace: ingress-nginx
spec:
secretName: preview-tuist-dev-wildcard-tls
issuerRef:
name: letsencrypt-cloudflare
kind: ClusterIssuer
commonName: "*.preview.tuist.dev"
dnsNames:
- "*.preview.tuist.dev"
- "preview.tuist.dev"
YAML
# Wait for issuance (DNS-01 is ~1–3 min).
kubectl -n ingress-nginx get certificate preview-tuist-dev-wildcard -w
The --default-ssl-certificate=ingress-nginx/preview-tuist-dev-wildcard-tls flag we passed to ingress-nginx makes this single Secret cover every preview Ingress automatically — no per-namespace TLS plumbing.
export KUBECONFIG=~/.kube/tuist-preview.yaml
sed 's/__NAMESPACE__/preview-system/g' infra/k8s/syself/ci-service-account.yaml \
| kubectl apply -f -
SERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')
CA=$(kubectl -n preview-system get secret github-actions-deployer-token -o jsonpath='{.data.ca\.crt}')
TOKEN=$(kubectl -n preview-system get secret github-actions-deployer-token -o jsonpath='{.data.token}' | base64 -d)
cat > /tmp/preview-kubeconfig.yaml <<EOF
apiVersion: v1
kind: Config
clusters:
- name: tuist-preview
cluster:
server: $SERVER
certificate-authority-data: $CA
contexts:
- name: ci
context: { cluster: tuist-preview, user: github-actions-deployer }
users:
- name: github-actions-deployer
user: { token: $TOKEN }
current-context: ci
EOF
base64 < /tmp/preview-kubeconfig.yaml | gh secret set KUBECONFIG \
--env server-k8s-preview --repo tuist/tuist
shred -u /tmp/preview-kubeconfig.yaml
The original design had CI scale the preview MachineDeployment to/from 0 via a narrow SA on the management cluster. That requires create serviceaccounts,roles,rolebindings in your org-tuist namespace, which Syself's OIDC users don't have by default. Until Syself grants those (open a support ticket if you want to revisit), the preview cluster keeps a fixed replicas: 1 worker pool. See the "Why pinned replicas: 1" comment at the top of infra/k8s/syself/workload-cluster-preview.yaml.
The unused preview-mgmt-rbac.yaml manifest stays checked in — once the perms are granted, applying it + setting KUBECONFIG_MGMT is enough to re-enable elastic scaling.
gh workflow run preview-deploy.yml \
-f pr_number=1234 \
-f ttl_hours=24
Or for a one-off commit:
gh workflow run preview-deploy.yml \
-f commit_sha=abc1234567890... \
-f ttl_hours=4
The workflow scales the worker pool up if needed (~3–5 min cold start), labels/taints the new node, runs helm upgrade --install, and posts the URL back to the PR. The hourly preview-sweep.yml workflow handles deletion + scale-down.
Workload cluster:
export KUBECONFIG=~/.kube/tuist-syself-mgmt.yaml
kubectl -n "$ORG_NS" delete cluster tuist-staging
Syself will drain + delete the nodes and release the Hetzner resources. The Hetzner Cloud project, LB, and DNS records are yours to clean up separately if you're done with the provider.
kubectl get clusters opens the browser but fails afterwards
Check the namespace: in the kubeconfig context — Syself's template ships with a placeholder that must be overwritten with org-tuist.
Cluster stuck in InfrastructureReady: false
kubectl -n "$ORG_NS" describe cluster tuist-staging
kubectl -n "$ORG_NS" get hetznercluster,hcloudmachine,machine
Most often a bad hetzner Secret (API token typo, missing permission, or missing hcloud-ssh-key-name).
HCloudMachine stuck with ServerCreateFailedIrrecoverableError / "unsupported location"
Hetzner's per-DC capacity or server-type stock varies. Two common causes:
kubectl patch cluster ... controlPlaneMachineTypeHcloud or workerMachineTypeHcloud) and kubectl delete machine the stuck ones so CAPI reconciles.https://console.hetzner.cloud/your-account/limits and request an increase via support.LoadBalancer stuck in <pending>
Hetzner's cloud-controller-manager needs the load-balancer.hetzner.cloud/location annotation on the Service to pick a DC. Verify kubectl describe svc <name> includes it. CCM logs:
kubectl -n kube-system logs -l app=hcloud-cloud-controller-manager
** (ArgumentError) argument error from the server pod
MASTER_KEY is wrong or missing. ESO sync issue or 1Password item name mismatch. Check:
kubectl -n tuist-staging get externalsecret
kubectl -n tuist-staging describe externalsecret tuist-master-key
Helm upgrade times out on the migration Job
kubectl -n tuist-staging logs job/tuist-tuist-server-migrate-<revision>
Usually Supabase TLS — check TUIST_USE_SSL_FOR_DATABASE=1 is set and the ssl: verify_none workaround is in runtime.exs.