docs/craft/features/streaming/preserve-opencode-sessions.md
Craft sessions use opencode serve as the long-lived agent runtime inside a
sandbox. Onyx persists the BuildSession.opencode_session_id in Postgres so a
later turn can reconnect to the same opencode session instead of creating a
fresh one every message.
That ID alone is not enough after a Kubernetes sandbox sleeps, is evicted, or is recreated. The opencode session rows live inside opencode's data directory in the sandbox filesystem. If the sandbox-level opencode data is not persisted and restored, the Postgres ID points at nothing.
The current implementation persists opencode history as sandbox-global state, separate from normal per-session workspace snapshots.
There are two distinct persistence surfaces.
Normal session snapshots capture only session-local user output:
outputs/attachments/ when present and non-emptyThey deliberately do not capture .opencode-data. These archives are created
and restored by the sandbox sidecar through:
POST /snapshot/createPOST /snapshot/restore/{session_id}The sidecar owns local filesystem access. The API server streams the archive
into FileStore through SnapshotManager.
Opencode history is shared by all BuildSessions in a sandbox. In Kubernetes the pod is configured with:
OPENCODE_DATA_HOME=/workspace/opencode-dataThe opencode data root is:
/workspace/opencode-data
The durable FileStore object is deterministic per sandbox:
sandbox-snapshots/{tenant_id}/{sandbox_id}/opencode-history.tar.gz
The archive stores that directory under a stable archive root:
.opencode-data/
This keeps opencode persistence separate from session workspaces while avoiding a custom per-session opencode store that would not match opencode's actual sandbox-level data model.
SnapshotManagerbackend/onyx/server/features/build/sandbox/snapshot_manager.py
Owns FileStore persistence for both normal session snapshots and sandbox-global
opencode history snapshots. Normal sidecar-created workspace snapshots keep the
sidecar's archive and uncompressed-size checks. Opencode history snapshots use
the deterministic storage path above and are not capped by workspace snapshot
limits. Kubernetes still bounds the pod-local opencode data volume with an
emptyDir.sizeLimit.
backend/onyx/server/features/build/sandbox/image/sandbox_daemon/server.py
The sidecar exposes local filesystem operations as signed HTTP endpoints. It does not upload to S3 and does not know tenant storage credentials.
For opencode history:
GET /ready returns healthy only after the startup restore path has restored
or explicitly skipped opencode history. This endpoint is used as the
restartable init sidecar startup gate, not as the steady-state pod readiness
signal.POST /opencode-history/create returns 204 when the opencode data directory
has no content, otherwise streams a gzip archive.POST /opencode-history/restore accepts a signed, hash-verified archive body
and restores the opencode data directory locally.POST /opencode-history/mark-restored marks a fresh sandbox ready when no
durable history snapshot exists.backend/onyx/server/features/build/sandbox/image/sandbox_daemon/opencode_history.py
This module owns the opencode data archive logic:
/workspace/sessionsopencode/opencode.db, when present, with a SQLite
backup so the archive carries a coherent DB snapshot even if opencode serve
is runningdata filter, then replace
/workspace/opencode-data with the extracted .opencode-data/ root.opencode-data/opencode/opencode.db is present but corrupt after
restore, clear the restored opencode data directory so opencode serve starts
fresh/workspace/managed/.onyx/opencode-history-restoredThis is separate from snapshot.py, which now remains focused on normal
session workspace snapshotting.
backend/onyx/server/features/build/sandbox/kubernetes/kubernetes_sandbox_manager.py
The K8s manager coordinates pod lifecycle, sidecar calls, FileStore streaming, and startup restore gating.
It is the only backend currently advertising:
supports_opencode_history_persistence = True
Craft's Kubernetes pod template uses a native restartable init sidecar, so Craft Helm deployments require Kubernetes 1.33 or newer. This is enforced at deployment/render time by the chart, not by a runtime backend version check.
When a Kubernetes sandbox pod starts:
sidecar init container. Its health
endpoint is available, but its startup endpoint stays blocked./opencode-history/mark-restored./opencode-history/restore./ready endpoint succeeds, which releases the restartable init
sidecar startup gate.sandbox app container. Its entrypoint runs
opencode serve with XDG_DATA_HOME pointed at
/workspace/opencode-data.opencode serve readiness.The important property is that opencode serve never starts before restore
has completed or been explicitly skipped.
When the K8s manager finds an already healthy sandbox pod, it reuses that pod without re-running startup history restore.
Opencode history snapshots are created before a sandbox sleeps and during best-effort recovery.
/opencode-history/create.204.sqlite3.Connection.backup() copy, creates a
tar.gz archive, and streams it back.SnapshotManager.SnapshotManager stores it at the stable sandbox-level FileStore key.If the sidecar returns 204 for an empty live store, the manager preserves any
existing durable history archive. That is important for idle/recovery paths: a
transient live empty/missing DB should not destroy the last known good history.
The prompt path is intentionally optimistic.
BuildSession.opencode_session_id when one has been
persisted._ensure_opencode_session_id mints and persists an opencode
session ID if the row has none.yield_sandbox_events calls sandbox_manager.send_message with the saved
ID and an on_opencode_session_resolved callback._send_message_via_serve calls OpencodeServeClient.ensure_session.200 and the same ID is reused.404, Onyx creates a fresh opencode session and invokes
the callback so the BuildSession row is updated.This means a restored sandbox with a missing opencode ID does not fail the user turn. It starts a new opencode session and records the new ID. The tradeoff is that opencode itself does not yet receive prior chat history in that newly created session. That replay behavior is intentionally out of scope for this change.
Deleting a BuildSession deletes Onyx's durable session record. When the sandbox
is running and the row has an opencode_session_id, Onyx also makes a
best-effort request to delete that live opencode session. That cleanup is an
optimization only: failures are logged and do not block deleting the Onyx row.
Opencode history remains sandbox-global implementation data, so session delete does not prune durable opencode history archives. If opencode still has a row for the deleted BuildSession, that row is orphaned and no longer reachable through Onyx.
SessionManager.delete_session acquires the session prompt slot.For a sleeping or otherwise not-running sandbox, deletion does not try to edit or validate opencode history. The Onyx row is removed, so any stale opencode record left in a durable history archive is orphaned implementation data.
The sandbox cleanup task handles idle running sandboxes.
When Onyx detects an unhealthy running sandbox and needs to terminate/recover it, the lifecycle code attempts a best-effort opencode history snapshot before termination.
This path is best-effort because the sandbox may already be partially dead. A failure is logged but does not block recovery forever.
On reprovision, the normal restore flow restores the last durable opencode history snapshot if one exists.
User-requested sandbox reset is a destructive "start fresh" operation.
TERMINATED in the caller-owned transaction.The durable FileStore delete is external to the DB transaction. If termination fails after history deletion succeeds, the API reports reset failure and rolls back DB state, but the durable history object is already gone. This preserves the "start fresh" invariant on the next successful provision.
If the sandbox row is already TERMINATED, reset has no live pod or DB status
transition left to protect. In that case the durable history delete is
best-effort: failures are logged and the reset still returns success. A later
reset can retry the delete once FileStore recovers.
This is intentionally different from idle sleep. Sleep preserves history; reset removes it.
The normal snapshot loop iterates session directories and stores each session's workspace. That is the right model for outputs and attachments.
Opencode history is different:
So the implementation reuses the same high-level snapshot infrastructure
(SnapshotManager, FileStore, signed sidecar streaming), but keeps opencode
history as a sandbox-level archive with its own create/restore endpoints and
policy.
.opencode-data./workspace/sessions tree..opencode-data/ archive root.opencode/opencode.db is discarded in favor of a fresh
opencode data directory.opencode serve starts only after startup history restore is complete.404 still fail the turn.ENABLE_CRAFT=true is paired with a
non-Kubernetes sandbox backend.If a restored sandbox does not contain the saved opencode ID, Onyx now mints a new opencode session and persists it. That avoids blocking the user, but the new opencode session does not yet contain prior chat history.
The planned follow-up is to detect this replacement-session case and replay the saved BuildMessage history into opencode before sending the next user prompt. That should live above the low-level snapshot/restore path. The snapshot layer should continue to restore the DB when possible and stay storage-focused.
backend/onyx/server/features/build/sandbox/image/sandbox_daemon/opencode_history.pybackend/onyx/server/features/build/sandbox/image/sandbox_daemon/server.pybackend/onyx/server/features/build/sandbox/snapshot_manager.pybackend/onyx/server/features/build/sandbox/kubernetes/kubernetes_sandbox_manager.pybackend/onyx/server/features/build/sandbox/opencode/serve_client.pybackend/onyx/server/features/build/sandbox/serve_transport.pybackend/onyx/server/features/build/session/streaming.pybackend/onyx/server/features/build/session/manager.py