infra/docs/eks-operations-guide.md
Everything you need to know about how our Kubernetes setup works, how to monitor, scale, and what happens when things go wrong.
Production runs on three targets simultaneously:
User request
→ Cloudflare DNS (api-eks.kortix.com)
→ AWS Application Load Balancer (ALB)
→ EKS cluster (suna-eks) ← this is the primary
→ Your app pods
We also have:
→ Lightsail instance ← legacy, still running
→ ECS cluster (suna-ecs) ← legacy, still running
The EKS cluster is the main production target. Lightsail and ECS are still deployed to, but EKS handles the real traffic through api-eks.kortix.com.
Region: us-west-2 (Oregon)
If you're new to Kubernetes, here's the mental model. Think of it like a restaurant:
c7i.2xlarge — 8 vCPUs, 16 GB RAM each)Cluster: suna-eks (EKS v1.31)
├── Node Group: suna-api-nodes
│ ├── Instance type: c7i.2xlarge (8 vCPU, 16 GB RAM)
│ ├── Min nodes: 2
│ ├── Max nodes: 8
│ └── Desired: 3
│
├── Namespace: suna
│ ├── Deployment: suna-api (our backend app)
│ │ ├── Base replicas: 4 pods
│ │ ├── Each pod: 500m-1500m CPU, 2Gi-3Gi memory
│ │ └── Each pod runs 2 Gunicorn workers
│ │
│ ├── Service: suna-api (ClusterIP, routes traffic to pods)
│ ├── HPA: suna-api (autoscales 4-15 pods based on CPU)
│ ├── PDB: suna-api (keeps at least 50% of pods alive during disruptions)
│ ├── Ingress: suna-api (ALB → internet-facing)
│ └── Secret: suna-env (all the env vars from Secrets Manager)
│
├── Namespace: kube-system
│ ├── AWS Load Balancer Controller (manages the ALB)
│ ├── Cluster Autoscaler (adds/removes nodes)
│ ├── CloudWatch Observability addon (sends metrics to CloudWatch)
│ └── Better Stack Collector (sends logs/metrics to Better Stack)
│
└── Namespace: default
└── Better Stack Collector DaemonSet
Every pod gets these environment variables and resources:
| Setting | Value |
|---|---|
| CPU request | 500m (half a core) — guaranteed minimum |
| CPU limit | 1500m (1.5 cores) — hard ceiling |
| Memory request | 2Gi — guaranteed minimum |
| Memory limit | 3Gi — if it uses more, K8s kills it (OOMKill) |
| Workers | 2 Gunicorn workers per pod |
| Port | 8000 |
| Health check | /v1/health-docker |
What "request" vs "limit" means:
All the K8s infrastructure is defined in code using Pulumi (TypeScript). You'll find it in infra/:
infra/
├── environments/
│ ├── prod/index.ts ← production stack (EKS + monitoring)
│ ├── dev/index.ts ← dev (Lightsail only)
│ └── staging/index.ts ← staging (Lightsail only)
│
├── modules/
│ ├── kubernetes/
│ │ ├── cluster.ts ← EKS cluster + node groups
│ │ ├── workload.ts ← Deployment, Service, HPA, PDB, Ingress
│ │ ├── autoscaler.ts ← Cluster Autoscaler (Helm chart)
│ │ ├── iam.ts ← IAM roles for nodes, ALB controller, autoscaler
│ │ └── types.ts ← TypeScript interfaces
│ │
│ ├── monitoring/
│ │ ├── alarms.ts ← CloudWatch alarms + dashboard
│ │ └── types.ts
│ │
│ └── lightsail/
│ ├── instance.ts ← Lightsail for dev/staging
│ └── types.ts
If you want to change the number of pods, node types, memory limits, etc., edit the config values in environments/prod/index.ts and run pulumi up.
Defined in .github/workflows/docker-build.yml. When you push to the PRODUCTION branch:
Push to PRODUCTION
→ Build Docker image
→ Push to GHCR with tags `:prod` and `:<commit-sha>`
→ Deploy to Lightsail (SSH + docker compose) } These three
→ Deploy to ECS (aws ecs update-service) } run in
→ Deploy to EKS (kubectl set image) } parallel
The EKS deploy step:
When a new image is deployed, K8s doesn't kill all pods and start new ones. It does a rolling update:
Step 1: Start 1 new pod with the new image
Step 2: Wait until the new pod passes health checks
Step 3: Remove 1 old pod
Step 4: Repeat until all pods are running the new image
Our config:
maxUnavailable: 0 — never kill an old pod before a new one is ready (zero downtime)maxSurge: 1 — only create 1 extra pod at a time (don't overwhelm the nodes)So if you have 4 pods, during deployment you'll briefly have 5 (4 old + 1 new), then 4 (3 old + 1 new), etc.
When a pod is being removed:
terminationGracePeriodSeconds) to finish in-flight requestsThis means long-running agent executions get up to 2 minutes to complete during deployments.
We use topologySpreadConstraints to spread pods across nodes evenly. If you have 4 pods and 2 nodes, each node gets 2 pods. This way, if one node dies, you don't lose all your pods.
There are two levels of autoscaling, and they work together:
The Horizontal Pod Autoscaler watches CPU and memory usage across your pods and adds/removes pods to keep utilization around the target.
| Setting | Value |
|---|---|
| Min pods | 4 |
| Max pods | 15 |
| CPU target | 70% average utilization |
| Memory target | 80% average utilization |
| Scale up speed | Can double pods every 60 seconds |
| Scale down speed | Removes at most 25% of pods every 60 seconds |
| Scale down cooldown | Waits 5 minutes before scaling down (prevents flapping) |
| Scale up cooldown | Only waits 30 seconds before scaling up (fast reaction) |
HPA will scale up if either CPU or memory exceeds its target. So if CPU is at 40% but memory hits 85%, HPA will still add pods.
Example: If your 4 pods are averaging 85% CPU (or 85% memory), HPA will add more pods. If both drop below target, HPA will (after 5 minutes) remove pods back to the minimum of 4.
If HPA wants to add pods but there's no room on existing nodes, the Cluster Autoscaler adds new EC2 nodes.
| Setting | Value |
|---|---|
| Min nodes | 2 |
| Max nodes | 8 |
| Scale down threshold | Node is underutilized if < 65% used |
| Scale down delay | Waits 5 minutes after adding a node before considering removal |
| Scale down wait | Node must be underutilized for 5 minutes straight |
| Strategy | "least-waste" — picks the node size that wastes the least resources |
How they work together:
Traffic spikes
→ Pods CPU goes above 70%
→ HPA: "I need more pods!"
→ HPA creates new pods
→ If nodes are full, new pods are "Pending" (can't be scheduled)
→ Cluster Autoscaler sees Pending pods
→ Cluster Autoscaler: "I'll add a node!"
→ New EC2 node joins the cluster (~3-5 minutes)
→ Pending pods get scheduled on the new node
Traffic drops
→ Pods CPU drops below target
→ HPA: "Too many pods, removing some" (after 5 min cooldown)
→ Pods get removed
→ Some nodes become underutilized (< 65% used)
→ Cluster Autoscaler: "This node is mostly empty, removing it" (after 5 min)
→ Pods on that node get moved to other nodes first
→ Empty node is terminated
During voluntary disruptions (node drain, cluster upgrade, autoscaler removing a node), the PDB guarantees at least 50% of pods stay running. So if you have 4 pods, at least 2 must be alive during any maintenance operation.
Instead of scheduled pod restarts (which would kill in-progress agent runs), we rely on K8s native mechanisms:
K8s constantly checks if your app is healthy using three probes. All three hit the same endpoint (/v1/health-docker on port 8000) but serve different purposes:
| Setting | Value |
|---|---|
| Initial delay | 10 seconds |
| Check interval | Every 10 seconds |
| Max failures | 12 |
| Timeout per check | 5 seconds |
This gives the app up to 130 seconds (10s delay + 12 x 10s) to start. During this time, liveness and readiness probes are paused. This is important because our app takes a while to load models, connect to databases, etc.
| Setting | Value |
|---|---|
| Initial delay | 15 seconds |
| Check interval | Every 10 seconds |
| Max failures | 3 |
| Timeout per check | 5 seconds |
If this fails 3 times in a row, K8s stops sending traffic to this pod (removes it from the load balancer). The pod stays alive — it just doesn't get requests. Once it starts passing again, traffic resumes.
Think of it as: "This pod is alive but busy/unhealthy, give it a break."
| Setting | Value |
|---|---|
| Initial delay | 30 seconds |
| Check interval | Every 30 seconds |
| Max failures | 3 |
| Timeout per check | 5 seconds |
If this fails 3 times in a row, K8s kills and restarts the pod. This catches deadlocks, stuck event loops, memory corruption — cases where the process is technically running but completely broken.
Think of it as: "This pod is dead inside, restart it."
Pod crashes
→ K8s immediately starts a new one (restartPolicy: Always)
→ Other pods keep serving traffic (no downtime)
→ If it crashes repeatedly, K8s uses "backoff" — waits longer between restarts
(10s, 20s, 40s, 80s, ... up to 5 minutes)
→ This is called CrashLoopBackOff in pod status
You lose nothing. K8s handles this automatically. The other pods keep running.
Pod memory exceeds 3Gi limit
→ Linux kernel kills the process (OOMKilled)
→ K8s sees the container exited with code 137
→ K8s restarts the pod on the same node
→ If it keeps OOMing, it goes into CrashLoopBackOff
If you see this happening a lot, you need to either fix the memory leak or increase the memory limit.
Node goes down
→ K8s notices within ~40 seconds (node heartbeat stops)
→ K8s marks all pods on that node as "Unknown"
→ After ~5 minutes, K8s evicts those pods
→ K8s schedules replacement pods on the remaining healthy nodes
→ If remaining nodes don't have enough room, Cluster Autoscaler adds a new node
→ Your PDB guarantees at least 50% of pods were on OTHER nodes already
Because we spread pods across nodes (topologySpreadConstraints), losing one node typically means losing only 1-2 pods out of 4+. The other pods keep serving traffic.
App stops responding to /v1/health-docker
→ Readiness probe fails (3 times × 10s = 30 seconds)
→ K8s removes pod from load balancer (no new traffic)
→ Liveness probe fails (3 times × 30s = 90 seconds)
→ K8s kills and restarts the pod
→ Pod comes back up, passes probes, rejoins load balancer
Total time from "app stuck" to "pod restarted" is roughly 2 minutes.
New image deployed via rolling update
→ K8s starts 1 new pod with new image
→ New pod fails startup probe (crashes, health check fails, etc.)
→ K8s does NOT remove any old pods (maxUnavailable: 0)
→ Rollout is stuck — old pods keep serving traffic
→ After 5 minutes, the CI/CD timeout reports failure
The old pods keep running. Your users don't notice anything. To fix it:
# See what's wrong
kubectl describe pod -n suna -l app.kubernetes.io/name=suna-api
# Roll back to the previous version
kubectl rollout undo deployment/suna-api -n suna
# Verify
kubectl rollout status deployment/suna-api -n suna
This is extremely unlikely (would require all nodes in multiple availability zones to fail simultaneously), but if it happens:
All pods down
→ ALB health checks fail
→ ALB returns 503 to all requests
→ CloudWatch alarm fires (pod count < 1)
→ Email alert sent via SNS
→ Cluster Autoscaler adds new nodes
→ K8s reschedules pods on new nodes
→ ALB health checks pass, traffic resumes
Manual recovery if needed:
# Check cluster status
kubectl get nodes
kubectl get pods -n suna
# Force restart all pods
kubectl rollout restart deployment/suna-api -n suna
# If nodes are gone, check AWS console for the managed node group
# Or scale it manually:
aws eks update-nodegroup-config \
--cluster-name suna-eks \
--nodegroup-name suna-api-nodes \
--scaling-config minSize=2,maxSize=8,desiredSize=3
We have three monitoring layers:
Dashboard: Go to AWS Console → CloudWatch → Dashboards → suna-api-prod
Shows:
Alarms (sends email alerts):
| Alarm | Threshold | Severity |
|---|---|---|
| CPU Warning | > 70% for 10 minutes | Warning |
| CPU Critical | > 85% for 2 minutes | Critical |
| Memory Warning | > 75% for 10 minutes | Warning |
| Memory Critical | > 90% for 2 minutes | Critical |
| Pod Count Low | < 1 running pod for 2 minutes | Critical |
| Node Count Low | < 2 nodes for 2 minutes | Critical |
Better Stack collects logs and metrics from all containers in the cluster using an eBPF-based collector (DaemonSet — runs on every node).
api-eks.kortix.comFor quick checks from your laptop:
# See everything at a glance
bash infra/scripts/k8s-monitor.sh
# Or individual commands:
kubectl top nodes # Node CPU/memory
kubectl top pods -n suna # Pod CPU/memory
kubectl get pods -n suna # Pod status
kubectl get hpa -n suna # Autoscaler status
kubectl get events -n suna --sort-by='.lastTimestamp' # Recent events
# What image is each pod running?
kubectl get pods -n suna -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
# Is the latest deploy live?
kubectl get deployment suna-api -n suna -o jsonpath='{.spec.template.spec.containers[0].image}'
# How many pods are running?
kubectl get deployment suna-api -n suna
# What's the HPA doing?
kubectl get hpa -n suna
# Graceful rolling restart (zero downtime)
kubectl rollout restart deployment/suna-api -n suna
# Watch the restart progress
kubectl rollout status deployment/suna-api -n suna
# Logs from a specific pod
kubectl logs <pod-name> -n suna
# Logs from all suna-api pods
kubectl logs -l app.kubernetes.io/name=suna-api -n suna --tail=100
# Follow logs in real-time
kubectl logs -l app.kubernetes.io/name=suna-api -n suna -f
# Logs from a crashed pod (previous container)
kubectl logs <pod-name> -n suna --previous
# Scale pods (temporarily — HPA may override this)
kubectl scale deployment/suna-api -n suna --replicas=6
# To permanently change, update the HPA min:
kubectl patch hpa suna-api -n suna -p '{"spec":{"minReplicas":6}}'
# Scale nodes (via AWS)
aws eks update-nodegroup-config \
--cluster-name suna-eks \
--nodegroup-name suna-api-nodes \
--scaling-config minSize=3,maxSize=8,desiredSize=4
# See deployment history
kubectl rollout history deployment/suna-api -n suna
# Roll back to previous version
kubectl rollout undo deployment/suna-api -n suna
# Roll back to a specific revision
kubectl rollout undo deployment/suna-api -n suna --to-revision=3
# See why a pod isn't starting
kubectl describe pod <pod-name> -n suna
# Get a shell inside a running pod
kubectl exec -it <pod-name> -n suna -- /bin/bash
# See resource usage
kubectl top pod <pod-name> -n suna
# Node overview
kubectl get nodes -o wide
# Detailed node info (capacity, conditions, pods running on it)
kubectl describe node <node-name>
# What's running on each node
kubectl get pods -n suna -o wide
Environment variables (API keys, database URLs, etc.) flow like this:
AWS Secrets Manager (suna-env-prod)
→ CI/CD syncs to K8s secret (on every deploy)
→ K8s secret: suna-env in namespace suna
→ Mounted as env vars in every pod
Option A: Through a deploy (automatic)
Every time CI/CD deploys to EKS, it syncs secrets from Secrets Manager. So:
suna-env-prod)Option B: Manual sync (no code change needed)
Go to GitHub Actions → "Sync Secrets to EKS" → Run workflow:
sync in the confirm fieldOr from the command line:
# Fetch from Secrets Manager and apply to K8s
SECRET_JSON=$(aws secretsmanager get-secret-value \
--secret-id suna-env-prod \
--query SecretString --output text)
kubectl create secret generic suna-env -n suna \
--from-env-file=<(echo "$SECRET_JSON" | python3 -c "
import json, sys
data = json.load(sys.stdin)
for k, v in data.items():
print(f'{k}={v}')
") --dry-run=client -o yaml | kubectl apply -f -
# Restart pods to pick up new secrets
kubectl rollout restart deployment/suna-api -n suna
Pods need to be restarted to pick up secret changes — K8s doesn't hot-reload env vars.
The nodes are full. Either:
# Check what's waiting
kubectl get pods -n suna --field-selector=status.phase=Pending
kubectl describe pod <pending-pod> -n suna # Look at "Events" section
# Check node capacity
kubectl top nodes
# The Cluster Autoscaler should handle this automatically.
# Check its logs if nodes aren't being added:
kubectl logs -l app.kubernetes.io/name=cluster-autoscaler -n kube-system --tail=50
The app is crashing on startup repeatedly.
# See why
kubectl logs <pod-name> -n suna --previous
kubectl describe pod <pod-name> -n suna
# Common causes:
# - Missing env vars (secret not synced)
# - Database connection failed
# - OOMKilled (check memory limits)
# - Bad image (roll back)
Cluster Autoscaler only adds nodes when pods are Pending (can't be scheduled). If pods are running but using lots of memory, that's fine from K8s's perspective — the pods are scheduled and running.
If you want to add headroom:
# Increase the number of nodes
aws eks update-nodegroup-config \
--cluster-name suna-eks \
--nodegroup-name suna-api-nodes \
--scaling-config minSize=3,maxSize=8,desiredSize=4
# Roll back immediately
kubectl rollout undo deployment/suna-api -n suna
# Check what went wrong
kubectl logs -l app.kubernetes.io/name=suna-api -n suna --tail=200
kubectl get events -n suna --sort-by='.lastTimestamp'
You generally shouldn't need to, but if you do:
# Find the EC2 instance ID
kubectl get nodes -o wide # Note the INTERNAL-IP
# Use SSM Session Manager (no SSH key needed)
aws ssm start-session --target <instance-id>
Quick checklist:
kubectl get pods -n suna — Are pods Running? If not, it's a K8s/infra issue.kubectl top pods -n suna — Is CPU/memory maxed? If yes, scale up or fix the leak.kubectl logs <pod> -n suna — Are there errors? If yes, it's your code.kubectl get events -n suna — Any K8s-level events? (OOMKilled, FailedScheduling, etc.)| I want to... | Command |
|---|---|
| See all pods | kubectl get pods -n suna |
| See pod logs | kubectl logs <pod> -n suna |
| See resource usage | kubectl top pods -n suna |
| Restart all pods | kubectl rollout restart deployment/suna-api -n suna |
| Roll back | kubectl rollout undo deployment/suna-api -n suna |
| Scale pods | kubectl scale deployment/suna-api -n suna --replicas=6 |
| Check HPA | kubectl get hpa -n suna |
| Check nodes | kubectl get nodes -o wide |
| See events | kubectl get events -n suna --sort-by='.lastTimestamp' |
| Shell into pod | kubectl exec -it <pod> -n suna -- /bin/bash |
| Check current image | kubectl get deploy suna-api -n suna -o jsonpath='{.spec.template.spec.containers[0].image}' |
| Full monitoring dashboard | bash infra/scripts/k8s-monitor.sh |