docs/craft/sandbox/sandbox-exec-sidecar.md
kubectl exec onto the signed sidecarKubernetesSandboxManager drives in-pod work through 14 k8s_stream
(connect_get_namespaced_pod_exec) call sites. Most are filesystem operations
implemented as interpolated shell scripts and parsed via stdout sentinels. This
has three recurring problems:
agent_instructions.replace("'", "'\\''") in setup_session_workspace and
_regenerate_session_config, and content.replace("'", ...) in
write_sandbox_file."WORKSPACE_FOUND", "ERROR_NOT_FOUND", "DELETED", "WRITE_OK",
"EXISTS". This is the exact bug class that caused the
"EXISTS" in "NOT_EXISTS" substring failure in snapshot restore.ApiClient workaround (_rest_api_client vs _stream_api_client,
lines ~422-436) to avoid the streaming monkeypatch leaking into REST calls.The codebase already has the better channel: a signed HTTP sidecar daemon
(sandbox_daemon/server.py, port 8731, Ed25519-verified) that shares the
workspace and managed volumes read-write and already serves /push,
/snapshot/create, /snapshot/restore. Snapshots and file pushes were already
migrated off exec onto it. This plan finishes that migration for the remaining
filesystem ops, hardens the small residual that must stay as exec, and adopts
the complementary split: sidecar owns all filesystem/compute; exec is
retained only for in-sandbox-container process control.
setup_session_workspace line ~1668 and restore_snapshot line
~2003) and kill (in cleanup_session_workspace line ~1737). The dev server
must run as a long-lived process in the sandbox container (it serves traffic
on that container's exposed port); the sidecar can't host or signal it because
share_process_namespace=False. Everything else moves to the sidecar.bun install can move to the sidecar. The sidecar runs the same image, so
it has bun, and node_modules lands on the shared workspace volume. Only
the dev-server process is container-bound — not the install. This shrinks
residual exec to just start/stop.session_workspace_exists (3),
list_session_workspaces (4), _regenerate_session_config (6),
list_directory (7), read_file (8), _ensure_agents_md_attachments_section
(10), delete_file (12), write_sandbox_file (13), get_upload_stats (14),
and the filesystem halves of setup_session_workspace (1) and
cleanup_session_workspace (2).upload_file (11) is the only stdin-streaming exec; it tar-extracts into
attachments/ — i.e. exactly what /push already does. Reuse/extend the
existing /push endpoint rather than adding a new one.generate_pptx_preview (9) runs python preview.py (soffice/pdftoppm).
Tools are in the shared image so it can run in the sidecar, but it is
CPU-heavy and the sidecar is capped at 500m CPU. Decision point: give it its
own sidecar endpoint with bumped sidecar limits, or leave it as a hardened
exec in the sandbox container. Recommend sidecar endpoint + raise the sidecar
CPU limit, to keep the "no filesystem exec" invariant clean.pods/exec stays in the sandbox-manager RBAC Role
(deployment/helm/charts/onyx/templates/sandbox-rbac.yaml; K8s can't scope
exec by command), and the dual-ApiClient workaround stays (it's required as
long as any k8s_stream exists). Both shrink to serving only 3 call sites. A
later sandbox-container control endpoint could remove them entirely if desired.sandbox_daemon/contract.py (the daemon imports sandbox_daemon.contract; the
api-server imports the full
onyx.server.features.build.sandbox.image.sandbox_daemon.contract path). This is
the existing shared-contract pattern — both ends stay in sync._signed_sidecar_headers + _sandbox_pod_hosts (Service FQDN, then pod-IP
fallback for out-of-cluster CI), exactly as create_snapshot /
write_files_to_sandbox do today.image/Dockerfile) changes — coordinate the matching app/sandbox
image tag, and note the daemon endpoints also ship in that image.Add typed models to sandbox_daemon/contract.py for each new operation
(session setup, cleanup, list-sessions, exists, list-directory, read-file,
delete-file, write-file, upload-stats, ensure-attachments, regenerate-config,
pptx-preview). Structured request bodies + JSON responses replace every
stdout sentinel.
Add signed endpoints to sandbox_daemon/server.py, each guarded by the
existing _verify_signature pattern. Implement the filesystem logic in the
daemon (it can import and reuse helpers rather than shell out). Extend
/push (or add a sibling) to cover upload_file's tar-extract-with-collision
semantics. Decide pptx-preview placement per the note above.
Add a baked dev-server script to the image, e.g.
/workspace/session-dev-server.sh start|stop <session_path> <port>,
containing today's _build_nextjs_start_script logic plus the PID-kill from
cleanup. Exec it with positional args only — no interpolation. This
removes the last f-string exec and the agent_instructions quoting (AGENTS.md
is now written via the sidecar).
Rewrite the api-server methods to call the sidecar over httpx
(reusing _signed_sidecar_headers / _sandbox_pod_hosts) and to return
typed results. Split setup_session_workspace into: sidecar /session/setup
(mkdir/cp/symlink/AGENTS.md/bun-cache/bun-install) → hardened exec
session-dev-server.sh start. Split cleanup_session_workspace into:
hardened exec session-dev-server.sh stop → sidecar /session/cleanup
(rm -rf). restore_snapshot keeps the hardened dev-server-start exec and
moves _regenerate_session_config to the sidecar.
Delete _parse_ls_output stdout parsing, the sentinel checks, and the
per-method inline scripts as their callers move to typed responses. Keep
_stream_api_client and the 3 residual exec call sites.
Leave RBAC unchanged (pods/exec still needed for the dev-server). Add a
code comment at the residual exec sites noting they are the only remaining
exec users and why.
setup_session_workspace
then session_workspace_exists/list_session_workspaces, write_sandbox_file
read_file, upload_file + get_upload_stats + delete_file,
list_directory, and restore_snapshot. Assert the dev server still comes up
(residual exec path) and the webapp URL responds.session_workspace_exists returns the
correct boolean for both an existing and a non-existing session — the
"EXISTS" in "NOT_EXISTS" bug class — now via a typed response.Prefer the kind integration test as the backbone; add daemon unit tests only for the signature/error-path logic that integration can't cleanly force.