docs/craft/issues/opencode-serve-deploy-gotchas.md
Audience: future engineers (and agents) deploying or debugging the opencode-serve transport on a real Kubernetes cluster. Captures the failure modes hit during the first production-cluster rollout and the institutional knowledge about how the opencode-serve provider chain wires up at runtime.
Status: the rollout succeeded after the workarounds below. The underlying defects (RBAC gap, mutable-tag image cache, env-var refresh race) are real and should be fixed in source rather than patched per environment.
secrets not in api-server-roleSymptom:
Session creation failed: (403) Forbidden
secrets "sandbox-<id>-opencode-auth" is forbidden:
User "system:serviceaccount:onyx:onyx-workload-access" cannot
get resource "secrets" in API group "" in the namespace "onyx-sandboxes"
Root cause: the api server's Role in the sandbox namespace was
authored before the opencode-serve transport landed. It only granted
verbs on pods, pods/exec, and services. The new transport adds a
per-pod K8s Secret (sandbox-<id>-opencode-auth) holding the
OPENCODE_CONFIG_CONTENT and OPENCODE_SERVER_PASSWORD env values; the
api server create/get/replace/deletes it during provision and
cleanup.
Fix: add the verbs the api server actually uses to whatever defines the Role in your deployment system.
- apiGroups: [""]
resources: ["secrets"]
verbs: ["create", "get", "update", "delete"]
These are exactly the four verbs the api server exercises — and no more.
Map each to the _core_api. call that needs it:
call (kubernetes_sandbox_manager.py) | HTTP method | RBAC verb |
|---|---|---|
create_namespaced_secret | POST | create |
read_namespaced_secret | GET | get |
replace_namespaced_secret | PUT | update |
delete_namespaced_secret | DELETE | delete |
Watch out: replace_namespaced_secret is a PUT, which maps to the
update verb — not patch. If you grant patch instead of
update, the happy path still works but the 409-race re-provision
branch (see §1.3 — the path that calls replace_namespaced_secret after
a create conflict) will intermittently 403. list/watch are not used
on secrets at all.
backend/onyx/server/features/build/sandbox/kubernetes/kubernetes_sandbox_manager.py
is the source of truth for which k8s resources the api server touches —
grep for _core_api. method calls if you need to re-derive the verb set.
:latest tag + cached node imagesSymptom: after pushing a new sandbox image with the :latest tag,
freshly-provisioned sandbox pods kept running the old image.
opencode-serve was not listening on :4096; the container appeared to
be running the prior placeholder behavior.
Root cause: sandbox pods are created with
imagePullPolicy: IfNotPresent (hard-coded in
KubernetesSandboxManager._create_sandbox_pod, not overridable via helm
values). When the configured image reference is the mutable :latest,
the kubelet sees an image with that tag already cached on the node and
skips the pull. Chart-level pullPolicy: Always does not apply —
sandbox pods are created imperatively by the api server, not by the
chart.
Workaround: point the api server's SANDBOX_CONTAINER_IMAGE env at
a specific, immutable tag (e.g. onyxdotapp/sandbox:v0.1.45) for any
given deploy. A new tag forces the kubelet to pull, even under
IfNotPresent.
Source-level fix (deferred): change the pod spec to
imagePullPolicy: Always in _create_sandbox_pod. Eliminates the
per-deploy tag-bump dance.
OPENCODE_CONFIG_CONTENT env doesn't refresh on Secret updateSymptom: opencode-serve kept reporting Model not found: anthropic/claude-opus-4-7 even after:
build-mode-anthropic admin LLM provider,kubectl get secret … -o jsonpath='{.data.config}' | base64 -d to
show both providers in the new content).But kubectl exec sandbox-… -- env | grep OPENCODE_CONFIG_CONTENT
still showed the old, single-provider value.
Root cause: Kubernetes does not propagate Secret updates to env
vars on running pods. Env from secretKeyRef is a snapshot taken at
container start. Only pod deletion (or recreate-replace at the pod
level) forces the kubelet to read the current Secret.
Symptom amplification: opencode-serve reads OPENCODE_CONFIG_CONTENT
directly at startup and uses that env value as the source of truth
for its provider registry. Stale env → stale opencode provider list.
_provision_opencode_secret rebuilds the Secret correctly, but no code
path in the api server deletes the pod afterward.
Recovery: kubectl delete pod sandbox-<id> after any admin
LLM-provider change that needs to reach an already-provisioned sandbox.
The api server's normal reconciliation (health check fails → terminate
→ re-provision) recreates the pod, which loads fresh env from the
current Secret.
Source-level fix (deferred): when the api server detects that
all_llm_configs has changed vs. the deployed sandbox, it should
delete the pod (forcing env refresh), not just replace_namespaced_secret.
_get_all_llm_configs only picks up build-mode-* providersSymptom: admin had an Anthropic provider configured with
name: "anthropic". opencode prompts for providerID: anthropic
returned ProviderModelNotFoundError. The sandbox's
OPENCODE_CONFIG_CONTENT only contained openai.
Root cause: fetch_all_build_mode_llm_providers in
backend/onyx/server/features/build/db/build_session.py filters with
LLMProviderModel.name.like("build-mode-%"). Providers without the
build-mode- name prefix are silently dropped during sandbox
provisioning.
Fix: add a second LLMProvider row with name = "build-mode-anthropic"
(the provider type stays anthropic, same API key, same visible
models). Same pattern for any other provider you want preloaded.
Why the prefix exists: it discriminates the craft-mode LLM provider set from the regular chat product's providers so API keys and model visibility don't bleed across products.
_get_all_llm_configs de-dups by provider type, not by rowWhen the user's BuildSession has agent_provider="anthropic", the
default llm_config IS anthropic. _get_all_llm_configs then does
roughly:
configs = [default]
seen_providers = {default.provider} # {"anthropic"}
for provider in fetch_all_build_mode_llm_providers(db):
if provider.provider in seen_providers: # build-mode-anthropic → "anthropic" → skip
continue
…
So the multi-provider list collapses back to [anthropic_config] if
the only other build-mode-* row is also anthropic. To have a sandbox
preloaded with two providers, you need two build-mode-* rows with
different provider types.
51b780b9f8 — PodEventBus.closed property + eviction in
_get_or_create_event_bus. Mentioned here so future readers
inspecting the bus lifecycle understand the existing safety net: when
a bus exhausts its reconnect budget (20 consecutive failures), it
self-closes; the next _get_or_create_event_bus call evicts it and
builds a fresh one with the current Secret's password.
ensure_session (already shipped)e6e8a109e9 — 3-attempt retry with linear backoff on
httpx.ConnectError/RemoteProtocolError in
_http_with_cold_pod_retry. Subsequently tightened in 28fd724314 —
POST /session retries only on ConnectError (not
RemoteProtocolError) to avoid an orphan-session leak.
This is the load-bearing institutional knowledge. Without it the bugs in §1 look like a random pile of subsystems; with it they're obvious in advance.
llm_provider rows. The name field controls
whether the row participates in craft (build-mode-* pattern).
provider is the type (anthropic, openai, openrouter, …).agent_provider / agent_model record
the user's selection. Defaults come from fetch_default_llm_model().KubernetesSandboxManager.provision):
_get_llm_config(requested_provider_type, requested_model_name)
resolves the default for this sandbox. Tries
fetch_llm_provider_by_type_for_build_mode(provider_type) first
(which prefers build-mode-{type} then falls back to any provider
of that type). Returns an LLMProviderConfig._get_all_llm_configs(default=llm_config) builds the full list:
[default] + every other build-mode-* provider whose provider
type isn't already in seen_providers and that has at least one
is_visible=True model.build_multi_provider_opencode_config(providers, default_provider, default_model, disabled_tools) produces the opencode.json shape:
{$schema, model, provider: {<type>: {options.apiKey, models}}, enabled_providers, permission}.json.dumps'd and written to a per-pod K8s Secret
(sandbox-<id>-opencode-auth, key config) alongside the per-pod
password (HTTP Basic auth for opencode-serve).secretKeyRef env
entries: OPENCODE_CONFIG_CONTENT and OPENCODE_SERVER_PASSWORD.
k8s does NOT propagate Secret updates to running pod envs — see
§1.3.opencode serve --hostname 0.0.0.0 --port "$OPENCODE_SERVE_PORT" --print-logs in a while true restart loop.
It does NOT materialize OPENCODE_CONFIG_CONTENT to a file — the
env var is opencode's primary config source (opencode also probes
filesystem paths but the env path is what craft uses).service=provider status=started
service=provider providerID=<X> found
service=provider status=completed
found line for a provider means opencode silently
dropped it from the registry._post_prompt_async
sets body["model"] = {"providerID": …, "modelID": …} when both are
set on the BuildSession. opencode looks up providerID/modelID in
its merged registry; if the providerID isn't registered →
ProviderModelNotFoundError.OPENCODE_SERVER_USERNAME = "opencode" (constant in
backend/onyx/server/features/build/configs.py). Not "onyx" —
this tripped up curl-based debugging.OPENCODE_SERVER_PASSWORD env (mounted from
the per-pod Secret). Both the api server (HTTP Basic) and any
in-pod curl debugging need this exact username.| Error / log line | Meaning |
|---|---|
secrets … is forbidden … namespace "onyx-sandboxes" | §1.1 — api-server Role missing secrets verbs |
[Errno 111] Connection refused on POST /session | opencode-serve not bound to :4096. Cold pod, crashloop, or process death |
ProviderModelNotFoundError w/ bundled-model suggestions | Provider not registered in opencode (silent drop from env config), or config not loaded |
Sandbox … has status provisioning and is being created by another request | Intentional guard. Two concurrent requests on the same user's mid-provision sandbox |
| Sandbox pod shows old behavior despite new image push | Cached :latest digest on the node + IfNotPresent policy (§1.2) |
opencode /event stream did not become ready | Bus reader failed to subscribe before the per-turn deadline. Causes: opencode-serve not bound yet (cold), or auth mismatch between bus's cached password and current Secret (bus will self-close and rebuild on next prompt — see §1.6) |
| opencode-serve restarts in tight loop, exit 143 | SIGTERM propagated from the entrypoint trap. Almost always operator-induced |
What providers does THIS sandbox actually have loaded?
SBX=$(kubectl -n onyx-sandboxes get pods -o name | head -1 | sed 's|pod/||')
kubectl -n onyx-sandboxes exec "$SBX" -c sandbox -- sh -c 'echo "$OPENCODE_CONFIG_CONTENT"' \
| python3 -c 'import sys, json; d=json.load(sys.stdin); print("enabled:", d.get("enabled_providers")); print("keys:", list(d.get("provider",{}).keys()))'
What does opencode-serve think it loaded? Tail the sandbox
container logs and look for service=provider providerID=<X> found
lines after the first prompt. If a provider is in the env var but
missing from this log, opencode silently dropped it (usually a JSON
shape problem or missing API key).
Test the opencode API directly (auth user is opencode, NOT onyx):
kubectl -n onyx-sandboxes exec sandbox-<id> -c sandbox -- sh -c '
curl -s -u "opencode:$OPENCODE_SERVER_PASSWORD" http://localhost:4096/doc \
| python3 -m json.tool | head -20
'
Tail opencode logs across pods (with stern):
stern -n onyx-sandboxes sandbox -c sandbox
name matches build-mode-%:
SELECT id, name, provider, default_model_name FROM llm_provider;
build-mode-* row pointing at
the same API key). The provider field stays as the type.model_configuration has is_visible=true for
that provider — silently skipped otherwise.service=provider providerID=<X> found after
the first prompt. If missing, the provider got dropped — recheck
the JSON shape in OPENCODE_CONFIG_CONTENT.kubectl -n onyx-sandboxes get pod sandbox-<id> \
-o jsonpath='{.status.containerStatuses[?(@.name=="sandbox")].imageID}'
SANDBOX_CONTAINER_IMAGE (api server env) at a
specific, immutable version tag and re-roll the api server +
delete the affected sandbox pod.The api server's Role in the sandbox namespace is out of date — see
§1.1. Cross-reference current Role rules against the k8s API calls
made from kubernetes_sandbox_manager.py (grep for _core_api. /
_rbac_api. method names) to derive the verb set the api server
actually needs.
| Item | Why it matters | Where to land it |
|---|---|---|
imagePullPolicy: Always on sandbox pods | Eliminates the :latest cache trap so deploys don't need a per-version tag bump | kubernetes_sandbox_manager._create_sandbox_pod |
| Recreate pod on Secret content change | Fixes §1.3 — env doesn't refresh on Secret update | _provision_opencode_secret or the provisioning caller |
secrets verbs in the chart-managed sandbox-namespace Role | Stops the §1.1 403 from recurring on every fresh deploy | Onyx helm chart's sandbox-rbac template |
Pre-seed build-mode-* providers in admin onboarding | Avoid the silent-skip filter trap in §1.4 | Onyx admin / setup wizard |
| Surface "sandbox provisioning" via 409 + Retry-After | Instead of RuntimeError → JSON dump, return a clean wait/retry signal the FE can poll on | session/manager._stream_cli_agent_response |
Wait for opencode-serve /doc before reporting RUNNING | Closes the "k8s-ready but opencode-not-bound" window where the first prompt races a cold opencode | KubernetesSandboxManager.provision (landed alongside this doc) |
Captured during the first production-cluster rollout of the opencode-serve transport. The imperative recovery commands above were validated against a single-tenant cluster; the source-level fixes are tracked in §4.