docs/internal/K8S_AUTHORITY_DESIGN.md
Scope: this is cluster- and cloud-agnostic — the transport is plain
kubectl exec, so it runs against EKS, GKE, AKS, k3d, minikube, or kind with no code changes. AWS/EKS/S3 specifics below are one reference provider, not a requirement.
Implementation status (foundational slice landed). The
RemoteTransportseam (services/remote/transport.rs) ships:KubectlExecTransport, the sharedkubectl_exec_argvbuilder,agent_bootstrap_pycode, the genericbootstrap_agent, andKubeConnection— the EKS analogue ofSshConnection, reusing the transport-agnosticAgentChannel. Plusbuild_kube_terminal_args,TerminalWrapper::kube, and theAuthority::kubeconstructor. SSH'sconnection.rsis untouched (SSH provably unchanged). The kubectl long-running (LSP) spawner (authority/kube_spawner.rs,KubectlLongRunningSpawner) ships too —sh -cwrapping for cwd/env sincekubectl exechas no-w/-e,command_existsviacommand -vwith the captured probe PATH, host-limit log-and-ignore — plusAuthority::kube_from_connectionthat assembles a full EKS authority (RemoteFileSystem + RemoteProcessSpawner over the channel + the kubectl LSP spawner). The per-session activation primitive (Editor::set_session_authority) is in. The agent heartbeat (spawn_heartbeat_task— a periodicinfoping, no agent.py change / no protocol bump) ships and is wired intoKubeConnectionso idlekubectl execstreams survive LB/NAT idle timeouts; it self-terminates via aWeakref. The asyncattachRemoteAgentplugin op is wired end to end: JSeditor.attachRemoteAgent(spec)→PluginCommand:: AttachRemoteAgent→ a runtime task runsconnect_kube_authority→AsyncMessage::RemoteAttachReady→install_authority_with_keepalive→ the restart loop (standalonemain.rsand daemonEditorServer) adopts the authority and its keepalive (the same slot SSH uses), so the live carrier + reconnect/heartbeat survive the rebuild. Thek8s-workspace.tsplugin ships the Provider model (attach-existing/manifest/run/command-Terraform escape hatch) with env-probe, RBAC-exec preflight, and connect/disconnect commands. The EKS reconnect task re-runskubectl execon stream drop. Unit + e2e tested; plugin type-checks. Remaining (core-team-gated): live multi-session so background cloud sessions stay warm (D4) — until then attach uses the destructive restart, which is correct, just not warm.
Status: design. Nothing else here ships yet. Supersedes the earlier
"EKS + S3 full cloud authority" draft, which proposed a bespoke
S3FileSystem. That half is deleted: per the refined requirement,
Fresh reaches workspace data only through the pod, so S3 never appears
in Fresh's code at all. It is demoted to a pod-provisioning detail (see
§"S3 is the pod's problem").
Read AUTHORITY_DESIGN.md first, then the SSH
remote design (ssh-remote-editing-design.md).
This document is small on purpose: the EKS authority is the SSH
authority with the transport swapped from ssh to kubectl exec.
The durable home for workspace bytes is an S3 bucket (a cheap storage tier). Fresh accesses those bytes only through a running pod that mounts the bucket. When the pod is down, the data still lives in S3 — Fresh simply can't open it until a pod is back. Bringing up a fresh pod against the same bucket restores access with no warm-up step.
Two consequences fall straight out:
S3FileSystem. Fresh talks to a pod, which
presents an ordinary POSIX view of the workspace.The remote-agent stack Fresh already ships for SSH is transport-agnostic end to end:
AgentChannel::from_transport<R, W> takes any
AsyncBufRead/AsyncWrite pair
(services/remote/channel.rs:119). It is not SSH-aware.ssh … python3 -u -c "exec(sys.stdin.read(N))", stream the agent
source into its stdin, wait for the ready line, then hand the
child's stdout/stdin to AgentChannel::new
(services/remote/connection.rs:117-186).RemoteFileSystem,
RemoteProcessSpawner, RemoteLongRunningSpawner, and the Python
agent itself — only ever talks to the AgentChannel. None of them
knows what carries the bytes.spawn_reconnect_task_with calls a
caller-supplied closure to produce a fresh (reader, writer) and
hot-swaps it via channel.replace_transport(...)
(connection.rs:252-323).So an EKS authority needs exactly one genuinely new thing: a transport
that bootstraps the agent over kubectl exec instead of ssh.
Everything else — file I/O, process spawn, LSP spawn, find-in-files,
save, auto-recovery, reconnect — is the SSH implementation, unchanged.
A RemoteTransport seam. Factor the "spawn the agent process and
give me (reader, writer) plus a respawn closure" step out of
connection.rs into a small trait with two impls:
/// Bootstraps the Python agent over some carrier and yields the
/// stdio pair the AgentChannel rides on. The respawn closure is what
/// the reconnect task calls to rebuild the carrier after a drop.
pub trait RemoteTransport: Send + Sync {
async fn connect(&self) -> Result<AgentStdio, TransportError>;
fn display(&self) -> String; // "user@host" / "eks:ctx/ns/pod"
}
pub struct SshTransport { params: ConnectionParams, /* … */ }
pub struct KubectlExecTransport {
context: Option<String>,
namespace: String,
pod: String,
container: Option<String>,
}
KubectlExecTransport::connect spawns
kubectl [--context CTX] exec -i -n NS [-c C] POD -- \
python3 -u -c "import sys;exec(sys.stdin.read(N))"
then performs the identical agent-source send + ready handshake
the SSH path already does. The bytes after handshake are the same
agent protocol over the same channel.
Authority::kube(...) — a near-clone of Authority::ssh(...). It
takes the already-built RemoteFileSystem / remote spawners (over the
kubectl-exec channel) and sets TerminalWrapper::kube(...). Like SSH,
path_translation: None — the editor operates directly in the pod's
path space (the mount looks like a normal directory in the pod;
there's nothing to translate).
TerminalWrapper::kube(target, workspace) — the only spawn that
does not ride the agent channel, exactly as SSH's terminal uses a
separate ssh -t PTY:
kubectl exec -it -n NS [-c C] POD -- sh -lc 'cd WS; exec "$SHELL" -l'
Pins cwd through its own args ⇒ manages_cwd = true, same rule as the
SSH and docker wrappers.
That's the entire Fresh-side surface. Process spawning, including LSP,
comes for free: RemoteProcessSpawner/RemoteLongRunningSpawner send
spawn RPCs to the agent, which launches them inside the pod. There is
no separate KubeExecSpawner and no docker_spawner-style argv builder —
the agent is already the in-pod executor.
SSH connects at startup (fresh user@host:path). EKS attaches
post-boot, driven by the pod-management plugin (see
K8S_WORKSPACE_PLUGIN_DESIGN.md). The
wrinkle: building the transport is async (spawn kubectl, bootstrap
the agent, await ready) and produces keepalive resources (the
child process, the Tokio runtime, the reconnect task). The synchronous
from_plugin_payload path can't express that — and shouldn't, because a
live stdio channel can't travel through a JSON payload.
So EKS attach reuses the SSH connect machinery, not the docker payload machinery:
A new plugin op editor.attachRemoteAgent(spec) where spec names a
transport ({ kind: "kubectl-exec", context, namespace, pod, container, workspace, displayLabel }). It is fire-and-forget with restart
semantics, exactly like setAuthority.
Core stashes the spec as a PendingAuthoritySpec and triggers the
same destructive restart install_authority uses. During rebuild
(the existing connect_remote / create_startup_authority seam, and
its EditorServer::rebuild_editor mirror), core runs
connect_remote_agent(transport):
async fn connect_remote_agent(t: Arc<dyn RemoteTransport>)
-> Result<(Arc<RemoteFileSystem>,
Arc<dyn ProcessSpawner>,
Arc<dyn LongRunningSpawner>,
RemoteKeepalive), ConnectError>;
SSH startup and EKS attach both call this; only the transport differs.
The resulting RemoteKeepalive (runtime + child + reconnect task)
rides in the session_keepalive slot so the daemon path keeps the
channel alive across the rebuild — the same slot SSH already uses for
SshConnection. (Under per-session authority — AUTHORITY_DESIGN.md
§"Evolution" — this slot becomes per-session, so each warm Cloud
Workspace owns its own keepalive rather than the process owning one.)
setAuthority (docker, local) and attachRemoteAgent (ssh-style remote
over a transport) are the two attach families. Keeping them separate is
honest: one swaps synchronously-constructible backends, the other
establishes a live connection core must own.
Fresh never sees S3. How the pod's workspace volume is provisioned is
owned entirely by the plugin / cluster manifest, not core. The earlier
draft of this doc proposed mounting the bucket live (Mountpoint for S3
CSI) and giving the agent an in-place direct_write save path. The
deep-research review (k8s-workspace-research-prompt.md findings)
killed that recommendation, and it's worth being explicit about why.
A code editor lives or dies on POSIX fidelity and small-file latency, and the benchmarks are damning for an S3 live mount:
rename on standard
buckets — it fails early with an I/O error rather than emulate
copy+delete. Fresh's save is temp-write-then-rename (save.rs,
temp_path_for), so every save would fail. It also blocks
directory renames, random mid-file writes, chmod/chown, and
sym/hard links. S3 Express One Zone + --allow-overwrite restores
file rename only, at higher cost and with no Local Zones — still no
dir-rename or random writes. A non-starter for a live workspace.git clone / npm install / venvs generate
exactly the synchronous small-file metadata storm EFS is worst at.So the live working volume is Amazon EBS GP3 (dynamic PV via the EBS
CSI driver): full POSIX, sub-ms metadata, ~3000 IOPS baseline. The
durable, cheap S3 tier is reached by syncing the EBS workspace to a
bucket — on graceful teardown (a preStop hook), periodically, and/or
debounced on save — and restoring it on a fresh pod's startup
(initContainer). This still satisfies the stated invariant ("the data
is in S3 when the pod is down") while keeping editing fast and saves
atomic.
The win for Fresh: zero core change. Against an EBS-backed POSIX
filesystem, the existing remote save path (temp + atomic rename) just
works — the direct_write flag the prior draft proposed is deleted
from this design. Storage policy lives entirely in the pod manifest /
plugin; Fresh is oblivious, exactly as intended.
See K8S_WORKSPACE_PLUGIN_DESIGN.md
§"Storage" for the full alternatives table and the recommended
EBS-live + S3-sync manifest pattern.
With the sync model, durability is "durable as of the last sync," not
"durable on keystroke." Tighten the window by syncing on save (debounced)
and on preStop, accepting more S3 PUT churn; or accept a coarser
periodic sync. Either way it is a pod-side sync policy, not a Fresh
concern. Pod scratch (build outputs, caches) is intentionally excluded
from the durable set and lost on pod death — that's the point.
The remote agent is Python over stdin. The workspace image must ship
python3 (same constraint SSH already imposes), or we kubectl cp a
static agent binary in before exec. v1: require python3, checked by the
plugin's preflight with a clear error.
The research surfaced three exec-layer realities the kubectl-exec
transport must handle. None require an S3FileSystem-style rethink, but
they shape the transport and reconnect logic:
exec stream that sees no traffic, with
no TCP FIN — the UI just freezes. The agent channel already has
reconnect (replace_transport), but it needs an application-level
heartbeat: a periodic no-op ping RPC (well under the timeout) to keep
the stream warm, plus prompt detection of a dropped channel to trigger
reconnect. This is a small addition to the agent protocol, shared by
SSH and EKS.kubectl exec <oldpod> — it must call back
to the plugin to re-resolve the current pod before reconnecting
(open question 3, now load-bearing). On gone, surface "workspace pod
ended → Rebuild," never a frozen screen.kubectl exec -it binary,
kubectl implements the TerminalSizeQueue (SIGWINCH) protocol and
the K8s ≥1.30 WebSocket negotiation for us — we don't hand-roll a
streaming API client. One inherited gotcha: K8s ≥1.30 routes exec
over WebSockets, and ≥1.35 requires the create verb on
pods/exec; the plugin's preflight must check the developer identity
holds it, or attach fails confusingly after a cluster upgrade.kubectl exec transport. New code is
one transport impl, one constructor, one terminal wrapper, one attach
op. Filesystem, spawners, agent, reconnect: reused verbatim.S3FileSystem. Fresh imports no AWS crate.direct_write.RemoteTransport refactor touches connection.rs, which SSH
depends on — must land behind tests proving SSH is byte-for-byte
unchanged before EKS rides on it.kubectl as a host dependency for v1 (vs. kube-rs WebSocket exec
later). Acceptable to start; RemoteTransport makes the swap local.
Note: using the kubectl binary is also how we inherit TTY-resize
and the SPDY→WebSocket transition for free.status()/up().exec streams survive ELB/NAT idle timeouts
(5-15 min) instead of silently freezing. Shared by SSH and EKS;
pick an interval well under the smallest common timeout (~60 s).