Back to Suna

Kortix EKS Operations Guide

infra/docs/eks-operations-guide.md

123.3 KB
Original Source

Kortix EKS Operations Guide

Everything you need to know about how our Kubernetes setup works, how to monitor, scale, and what happens when things go wrong.


Table of Contents

  1. The Big Picture
  2. How Our K8s Setup Works
  3. What Runs Where
  4. How Deployments Work
  5. How Scaling Works
  6. Health Checks (How K8s Knows If Your App Is Alive)
  7. What Happens When Things Break
  8. Monitoring
  9. Common Operations
  10. Secrets Management
  11. Troubleshooting

The Big Picture

Production runs on three targets simultaneously:

User request
  → Cloudflare DNS (api-eks.kortix.com)
  → AWS Application Load Balancer (ALB)
  → EKS cluster (suna-eks)          ← this is the primary
  → Your app pods

We also have:
  → Lightsail instance               ← legacy, still running
  → ECS cluster (suna-ecs)           ← legacy, still running

The EKS cluster is the main production target. Lightsail and ECS are still deployed to, but EKS handles the real traffic through api-eks.kortix.com.

Region: us-west-2 (Oregon)


How Our K8s Setup Works

If you're new to Kubernetes, here's the mental model. Think of it like a restaurant:

  • Cluster = the restaurant building (EKS manages this for us)
  • Nodes = the kitchen stations (EC2 machines, currently c7i.2xlarge — 8 vCPUs, 16 GB RAM each)
  • Pods = the chefs (each pod runs one copy of our backend app)
  • Deployment = the recipe card that says "always have 4 chefs working"
  • Service = the front desk that routes customers (requests) to available chefs
  • Ingress = the restaurant's street address (connects the ALB to our service)
  • HPA = the manager who hires more chefs when it gets busy

The actual resources in our cluster

Cluster: suna-eks (EKS v1.31)
├── Node Group: suna-api-nodes
│   ├── Instance type: c7i.2xlarge (8 vCPU, 16 GB RAM)
│   ├── Min nodes: 2
│   ├── Max nodes: 8
│   └── Desired: 3
│
├── Namespace: suna
│   ├── Deployment: suna-api (our backend app)
│   │   ├── Base replicas: 4 pods
│   │   ├── Each pod: 500m-1500m CPU, 2Gi-3Gi memory
│   │   └── Each pod runs 2 Gunicorn workers
│   │
│   ├── Service: suna-api (ClusterIP, routes traffic to pods)
│   ├── HPA: suna-api (autoscales 4-15 pods based on CPU)
│   ├── PDB: suna-api (keeps at least 50% of pods alive during disruptions)
│   ├── Ingress: suna-api (ALB → internet-facing)
│   └── Secret: suna-env (all the env vars from Secrets Manager)
│
├── Namespace: kube-system
│   ├── AWS Load Balancer Controller (manages the ALB)
│   ├── Cluster Autoscaler (adds/removes nodes)
│   ├── CloudWatch Observability addon (sends metrics to CloudWatch)
│   └── Better Stack Collector (sends logs/metrics to Better Stack)
│
└── Namespace: default
    └── Better Stack Collector DaemonSet

What each pod gets

Every pod gets these environment variables and resources:

SettingValue
CPU request500m (half a core) — guaranteed minimum
CPU limit1500m (1.5 cores) — hard ceiling
Memory request2Gi — guaranteed minimum
Memory limit3Gi — if it uses more, K8s kills it (OOMKill)
Workers2 Gunicorn workers per pod
Port8000
Health check/v1/health-docker

What "request" vs "limit" means:

  • Request = "I need at least this much." K8s won't schedule the pod on a node unless the node has this much free.
  • Limit = "Don't ever use more than this." If the pod tries to use more memory than the limit, K8s kills it. If it tries to use more CPU, K8s throttles it (slows it down but doesn't kill it).

What Runs Where

Infrastructure-as-Code (Pulumi)

All the K8s infrastructure is defined in code using Pulumi (TypeScript). You'll find it in infra/:

infra/
├── environments/
│   ├── prod/index.ts        ← production stack (EKS + monitoring)
│   ├── dev/index.ts          ← dev (Lightsail only)
│   └── staging/index.ts      ← staging (Lightsail only)
│
├── modules/
│   ├── kubernetes/
│   │   ├── cluster.ts        ← EKS cluster + node groups
│   │   ├── workload.ts       ← Deployment, Service, HPA, PDB, Ingress
│   │   ├── autoscaler.ts     ← Cluster Autoscaler (Helm chart)
│   │   ├── iam.ts            ← IAM roles for nodes, ALB controller, autoscaler
│   │   └── types.ts          ← TypeScript interfaces
│   │
│   ├── monitoring/
│   │   ├── alarms.ts         ← CloudWatch alarms + dashboard
│   │   └── types.ts
│   │
│   └── lightsail/
│       ├── instance.ts       ← Lightsail for dev/staging
│       └── types.ts

If you want to change the number of pods, node types, memory limits, etc., edit the config values in environments/prod/index.ts and run pulumi up.

CI/CD Pipeline

Defined in .github/workflows/docker-build.yml. When you push to the PRODUCTION branch:

Push to PRODUCTION
  → Build Docker image
  → Push to GHCR with tags `:prod` and `:<commit-sha>`
  → Deploy to Lightsail (SSH + docker compose)    } These three
  → Deploy to ECS (aws ecs update-service)         } run in
  → Deploy to EKS (kubectl set image)              } parallel

The EKS deploy step:

  1. Authenticates to AWS using OIDC (no stored credentials)
  2. Gets kubeconfig for the cluster
  3. Syncs secrets from AWS Secrets Manager → K8s secret
  4. Updates the deployment image to the new SHA tag
  5. Waits for the rollout to complete (up to 5 minutes)
  6. Prints the running pods and current image for verification

How Deployments Work

Rolling Updates

When a new image is deployed, K8s doesn't kill all pods and start new ones. It does a rolling update:

Step 1: Start 1 new pod with the new image
Step 2: Wait until the new pod passes health checks
Step 3: Remove 1 old pod
Step 4: Repeat until all pods are running the new image

Our config:

  • maxUnavailable: 0 — never kill an old pod before a new one is ready (zero downtime)
  • maxSurge: 1 — only create 1 extra pod at a time (don't overwhelm the nodes)

So if you have 4 pods, during deployment you'll briefly have 5 (4 old + 1 new), then 4 (3 old + 1 new), etc.

Graceful Shutdown

When a pod is being removed:

  1. K8s sends it a SIGTERM signal ("please shut down")
  2. The pod gets 120 seconds (terminationGracePeriodSeconds) to finish in-flight requests
  3. After 120s, K8s sends SIGKILL ("you're done, goodbye")

This means long-running agent executions get up to 2 minutes to complete during deployments.

Pod Distribution

We use topologySpreadConstraints to spread pods across nodes evenly. If you have 4 pods and 2 nodes, each node gets 2 pods. This way, if one node dies, you don't lose all your pods.


How Scaling Works

There are two levels of autoscaling, and they work together:

Level 1: Pod Autoscaling (HPA)

The Horizontal Pod Autoscaler watches CPU and memory usage across your pods and adds/removes pods to keep utilization around the target.

SettingValue
Min pods4
Max pods15
CPU target70% average utilization
Memory target80% average utilization
Scale up speedCan double pods every 60 seconds
Scale down speedRemoves at most 25% of pods every 60 seconds
Scale down cooldownWaits 5 minutes before scaling down (prevents flapping)
Scale up cooldownOnly waits 30 seconds before scaling up (fast reaction)

HPA will scale up if either CPU or memory exceeds its target. So if CPU is at 40% but memory hits 85%, HPA will still add pods.

Example: If your 4 pods are averaging 85% CPU (or 85% memory), HPA will add more pods. If both drop below target, HPA will (after 5 minutes) remove pods back to the minimum of 4.

Level 2: Node Autoscaling (Cluster Autoscaler)

If HPA wants to add pods but there's no room on existing nodes, the Cluster Autoscaler adds new EC2 nodes.

SettingValue
Min nodes2
Max nodes8
Scale down thresholdNode is underutilized if < 65% used
Scale down delayWaits 5 minutes after adding a node before considering removal
Scale down waitNode must be underutilized for 5 minutes straight
Strategy"least-waste" — picks the node size that wastes the least resources

How they work together:

Traffic spikes
  → Pods CPU goes above 70%
  → HPA: "I need more pods!"
  → HPA creates new pods
  → If nodes are full, new pods are "Pending" (can't be scheduled)
  → Cluster Autoscaler sees Pending pods
  → Cluster Autoscaler: "I'll add a node!"
  → New EC2 node joins the cluster (~3-5 minutes)
  → Pending pods get scheduled on the new node

Traffic drops
  → Pods CPU drops below target
  → HPA: "Too many pods, removing some" (after 5 min cooldown)
  → Pods get removed
  → Some nodes become underutilized (< 65% used)
  → Cluster Autoscaler: "This node is mostly empty, removing it" (after 5 min)
  → Pods on that node get moved to other nodes first
  → Empty node is terminated

PodDisruptionBudget (PDB)

During voluntary disruptions (node drain, cluster upgrade, autoscaler removing a node), the PDB guarantees at least 50% of pods stay running. So if you have 4 pods, at least 2 must be alive during any maintenance operation.

Memory Safety Net

Instead of scheduled pod restarts (which would kill in-progress agent runs), we rely on K8s native mechanisms:

  1. Memory-based HPA — When pods average > 80% memory, HPA adds more pods to spread the load
  2. OOMKill + auto-restart — If a pod exceeds its 3Gi memory limit, K8s kills and restarts it automatically. The other pods keep serving traffic with zero downtime.

Health Checks

K8s constantly checks if your app is healthy using three probes. All three hit the same endpoint (/v1/health-docker on port 8000) but serve different purposes:

Startup Probe — "Has the app finished booting?"

SettingValue
Initial delay10 seconds
Check intervalEvery 10 seconds
Max failures12
Timeout per check5 seconds

This gives the app up to 130 seconds (10s delay + 12 x 10s) to start. During this time, liveness and readiness probes are paused. This is important because our app takes a while to load models, connect to databases, etc.

Readiness Probe — "Can this pod handle traffic right now?"

SettingValue
Initial delay15 seconds
Check intervalEvery 10 seconds
Max failures3
Timeout per check5 seconds

If this fails 3 times in a row, K8s stops sending traffic to this pod (removes it from the load balancer). The pod stays alive — it just doesn't get requests. Once it starts passing again, traffic resumes.

Think of it as: "This pod is alive but busy/unhealthy, give it a break."

Liveness Probe — "Is this pod stuck/dead?"

SettingValue
Initial delay30 seconds
Check intervalEvery 30 seconds
Max failures3
Timeout per check5 seconds

If this fails 3 times in a row, K8s kills and restarts the pod. This catches deadlocks, stuck event loops, memory corruption — cases where the process is technically running but completely broken.

Think of it as: "This pod is dead inside, restart it."


What Happens When Things Break

Pod crashes (OOMKill, unhandled exception, etc.)

Pod crashes
  → K8s immediately starts a new one (restartPolicy: Always)
  → Other pods keep serving traffic (no downtime)
  → If it crashes repeatedly, K8s uses "backoff" — waits longer between restarts
    (10s, 20s, 40s, 80s, ... up to 5 minutes)
  → This is called CrashLoopBackOff in pod status

You lose nothing. K8s handles this automatically. The other pods keep running.

Pod uses too much memory (OOMKill)

Pod memory exceeds 3Gi limit
  → Linux kernel kills the process (OOMKilled)
  → K8s sees the container exited with code 137
  → K8s restarts the pod on the same node
  → If it keeps OOMing, it goes into CrashLoopBackOff

If you see this happening a lot, you need to either fix the memory leak or increase the memory limit.

Node dies (hardware failure, AWS issue, etc.)

Node goes down
  → K8s notices within ~40 seconds (node heartbeat stops)
  → K8s marks all pods on that node as "Unknown"
  → After ~5 minutes, K8s evicts those pods
  → K8s schedules replacement pods on the remaining healthy nodes
  → If remaining nodes don't have enough room, Cluster Autoscaler adds a new node
  → Your PDB guarantees at least 50% of pods were on OTHER nodes already

Because we spread pods across nodes (topologySpreadConstraints), losing one node typically means losing only 1-2 pods out of 4+. The other pods keep serving traffic.

App returns errors but process is running (deadlock, stuck, etc.)

App stops responding to /v1/health-docker
  → Readiness probe fails (3 times × 10s = 30 seconds)
  → K8s removes pod from load balancer (no new traffic)
  → Liveness probe fails (3 times × 30s = 90 seconds)
  → K8s kills and restarts the pod
  → Pod comes back up, passes probes, rejoins load balancer

Total time from "app stuck" to "pod restarted" is roughly 2 minutes.

Deployment goes bad (new image is broken)

New image deployed via rolling update
  → K8s starts 1 new pod with new image
  → New pod fails startup probe (crashes, health check fails, etc.)
  → K8s does NOT remove any old pods (maxUnavailable: 0)
  → Rollout is stuck — old pods keep serving traffic
  → After 5 minutes, the CI/CD timeout reports failure

The old pods keep running. Your users don't notice anything. To fix it:

bash
# See what's wrong
kubectl describe pod -n suna -l app.kubernetes.io/name=suna-api

# Roll back to the previous version
kubectl rollout undo deployment/suna-api -n suna

# Verify
kubectl rollout status deployment/suna-api -n suna

All pods on all nodes go down (catastrophic failure)

This is extremely unlikely (would require all nodes in multiple availability zones to fail simultaneously), but if it happens:

All pods down
  → ALB health checks fail
  → ALB returns 503 to all requests
  → CloudWatch alarm fires (pod count < 1)
  → Email alert sent via SNS
  → Cluster Autoscaler adds new nodes
  → K8s reschedules pods on new nodes
  → ALB health checks pass, traffic resumes

Manual recovery if needed:

bash
# Check cluster status
kubectl get nodes
kubectl get pods -n suna

# Force restart all pods
kubectl rollout restart deployment/suna-api -n suna

# If nodes are gone, check AWS console for the managed node group
# Or scale it manually:
aws eks update-nodegroup-config \
  --cluster-name suna-eks \
  --nodegroup-name suna-api-nodes \
  --scaling-config minSize=2,maxSize=8,desiredSize=3

Monitoring

We have three monitoring layers:

1. CloudWatch (AWS native)

Dashboard: Go to AWS Console → CloudWatch → Dashboards → suna-api-prod

Shows:

  • Node CPU utilization
  • Node memory utilization
  • Running pod count
  • Node count
  • Pod-level CPU and memory
  • Network I/O

Alarms (sends email alerts):

AlarmThresholdSeverity
CPU Warning> 70% for 10 minutesWarning
CPU Critical> 85% for 2 minutesCritical
Memory Warning> 75% for 10 minutesWarning
Memory Critical> 90% for 2 minutesCritical
Pod Count Low< 1 running pod for 2 minutesCritical
Node Count Low< 2 nodes for 2 minutesCritical

2. Better Stack (cloud observability)

Better Stack collects logs and metrics from all containers in the cluster using an eBPF-based collector (DaemonSet — runs on every node).

  • Logs: Telemetry → Logs (or Live tail)
  • Dashboards: Telemetry → Dashboards
  • Uptime: Set up HTTP monitors for api-eks.kortix.com

3. Terminal monitoring

For quick checks from your laptop:

bash
# See everything at a glance
bash infra/scripts/k8s-monitor.sh

# Or individual commands:
kubectl top nodes                          # Node CPU/memory
kubectl top pods -n suna                   # Pod CPU/memory
kubectl get pods -n suna                   # Pod status
kubectl get hpa -n suna                    # Autoscaler status
kubectl get events -n suna --sort-by='.lastTimestamp'  # Recent events

Common Operations

Check what's running

bash
# What image is each pod running?
kubectl get pods -n suna -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'

# Is the latest deploy live?
kubectl get deployment suna-api -n suna -o jsonpath='{.spec.template.spec.containers[0].image}'

# How many pods are running?
kubectl get deployment suna-api -n suna

# What's the HPA doing?
kubectl get hpa -n suna

Restart pods (without redeploying)

bash
# Graceful rolling restart (zero downtime)
kubectl rollout restart deployment/suna-api -n suna

# Watch the restart progress
kubectl rollout status deployment/suna-api -n suna

View logs

bash
# Logs from a specific pod
kubectl logs <pod-name> -n suna

# Logs from all suna-api pods
kubectl logs -l app.kubernetes.io/name=suna-api -n suna --tail=100

# Follow logs in real-time
kubectl logs -l app.kubernetes.io/name=suna-api -n suna -f

# Logs from a crashed pod (previous container)
kubectl logs <pod-name> -n suna --previous

Scale manually

bash
# Scale pods (temporarily — HPA may override this)
kubectl scale deployment/suna-api -n suna --replicas=6

# To permanently change, update the HPA min:
kubectl patch hpa suna-api -n suna -p '{"spec":{"minReplicas":6}}'

# Scale nodes (via AWS)
aws eks update-nodegroup-config \
  --cluster-name suna-eks \
  --nodegroup-name suna-api-nodes \
  --scaling-config minSize=3,maxSize=8,desiredSize=4

Roll back a deployment

bash
# See deployment history
kubectl rollout history deployment/suna-api -n suna

# Roll back to previous version
kubectl rollout undo deployment/suna-api -n suna

# Roll back to a specific revision
kubectl rollout undo deployment/suna-api -n suna --to-revision=3

Debug a pod

bash
# See why a pod isn't starting
kubectl describe pod <pod-name> -n suna

# Get a shell inside a running pod
kubectl exec -it <pod-name> -n suna -- /bin/bash

# See resource usage
kubectl top pod <pod-name> -n suna

Check node health

bash
# Node overview
kubectl get nodes -o wide

# Detailed node info (capacity, conditions, pods running on it)
kubectl describe node <node-name>

# What's running on each node
kubectl get pods -n suna -o wide

Secrets Management

Environment variables (API keys, database URLs, etc.) flow like this:

AWS Secrets Manager (suna-env-prod)
  → CI/CD syncs to K8s secret (on every deploy)
  → K8s secret: suna-env in namespace suna
  → Mounted as env vars in every pod

Update a secret

Option A: Through a deploy (automatic)

Every time CI/CD deploys to EKS, it syncs secrets from Secrets Manager. So:

  1. Update the value in AWS Secrets Manager (suna-env-prod)
  2. Push to PRODUCTION branch (or trigger a deploy)
  3. The deploy job syncs the secret and deploys the new image

Option B: Manual sync (no code change needed)

Go to GitHub Actions → "Sync Secrets to EKS" → Run workflow:

  1. Type sync in the confirm field
  2. Choose whether to restart the deployment
  3. Click "Run workflow"

Or from the command line:

bash
# Fetch from Secrets Manager and apply to K8s
SECRET_JSON=$(aws secretsmanager get-secret-value \
  --secret-id suna-env-prod \
  --query SecretString --output text)

kubectl create secret generic suna-env -n suna \
  --from-env-file=<(echo "$SECRET_JSON" | python3 -c "
import json, sys
data = json.load(sys.stdin)
for k, v in data.items():
    print(f'{k}={v}')
") --dry-run=client -o yaml | kubectl apply -f -

# Restart pods to pick up new secrets
kubectl rollout restart deployment/suna-api -n suna

Pods need to be restarted to pick up secret changes — K8s doesn't hot-reload env vars.


Troubleshooting

"Pods are stuck in Pending"

The nodes are full. Either:

bash
# Check what's waiting
kubectl get pods -n suna --field-selector=status.phase=Pending
kubectl describe pod <pending-pod> -n suna  # Look at "Events" section

# Check node capacity
kubectl top nodes

# The Cluster Autoscaler should handle this automatically.
# Check its logs if nodes aren't being added:
kubectl logs -l app.kubernetes.io/name=cluster-autoscaler -n kube-system --tail=50

"Pods are in CrashLoopBackOff"

The app is crashing on startup repeatedly.

bash
# See why
kubectl logs <pod-name> -n suna --previous
kubectl describe pod <pod-name> -n suna

# Common causes:
# - Missing env vars (secret not synced)
# - Database connection failed
# - OOMKilled (check memory limits)
# - Bad image (roll back)

"High memory but Cluster Autoscaler isn't adding nodes"

Cluster Autoscaler only adds nodes when pods are Pending (can't be scheduled). If pods are running but using lots of memory, that's fine from K8s's perspective — the pods are scheduled and running.

If you want to add headroom:

bash
# Increase the number of nodes
aws eks update-nodegroup-config \
  --cluster-name suna-eks \
  --nodegroup-name suna-api-nodes \
  --scaling-config minSize=3,maxSize=8,desiredSize=4

"Deploy succeeded but the app is broken"

bash
# Roll back immediately
kubectl rollout undo deployment/suna-api -n suna

# Check what went wrong
kubectl logs -l app.kubernetes.io/name=suna-api -n suna --tail=200
kubectl get events -n suna --sort-by='.lastTimestamp'

"I need to SSH into a node"

You generally shouldn't need to, but if you do:

bash
# Find the EC2 instance ID
kubectl get nodes -o wide  # Note the INTERNAL-IP

# Use SSM Session Manager (no SSH key needed)
aws ssm start-session --target <instance-id>

"How do I know if it's my code or K8s?"

Quick checklist:

  1. kubectl get pods -n suna — Are pods Running? If not, it's a K8s/infra issue.
  2. kubectl top pods -n suna — Is CPU/memory maxed? If yes, scale up or fix the leak.
  3. kubectl logs <pod> -n suna — Are there errors? If yes, it's your code.
  4. kubectl get events -n suna — Any K8s-level events? (OOMKilled, FailedScheduling, etc.)
  5. Check Better Stack logs for patterns.

Quick Reference Card

I want to...Command
See all podskubectl get pods -n suna
See pod logskubectl logs <pod> -n suna
See resource usagekubectl top pods -n suna
Restart all podskubectl rollout restart deployment/suna-api -n suna
Roll backkubectl rollout undo deployment/suna-api -n suna
Scale podskubectl scale deployment/suna-api -n suna --replicas=6
Check HPAkubectl get hpa -n suna
Check nodeskubectl get nodes -o wide
See eventskubectl get events -n suna --sort-by='.lastTimestamp'
Shell into podkubectl exec -it <pod> -n suna -- /bin/bash
Check current imagekubectl get deploy suna-api -n suna -o jsonpath='{.spec.template.spec.containers[0].image}'
Full monitoring dashboardbash infra/scripts/k8s-monitor.sh