docs/agents/codex-app-server-provider-notes.md
These notes track the planned Promptfoo integration for the Codex app-server protocol. They are intentionally implementation-facing: keep them current as the provider, docs, examples, and verification expand.
For the broader coding-agent provider taxonomy, see
coding-agent-provider-taxonomy.md.
Add an experimental Promptfoo provider that drives codex app-server directly. The
provider should complement, not replace, the existing OpenAI Codex SDK provider:
Primary provider IDs:
openai:codex-app-serveropenai:codex-app-server:<model>openai:codex-desktopopenai:codex-desktop:<model>Optional top-level aliases may be added after the OpenAI-scoped provider is stable:
codex:app-servercodex:desktop/Applications/Codex.app/Contents/Resources/codex app-server --helpcodex app-server generate-ts --out /tmp/codex-app-server-schema/ts
codex app-server generate-json-schema --out /tmp/codex-app-server-schema/json
Current local schema inspection was generated from codex-cli 0.118.0.
Transport:
stdio:// default, JSONL messages.ws://IP:PORT experimental, one JSON-RPC message per WebSocket text frame.Handshake:
initialize with Promptfoo client metadata.initialized notification.turn/completed.Core client requests:
initializethread/startthread/resumethread/archivethread/unsubscribethread/readturn/startturn/steerturn/interruptreview/startmodel/listskills/listplugin/listplugin/readapp/listHigh-risk client requests that should not be exposed casually:
fs/writeFilefs/removefs/copyconfig/value/writeconfig/batchWriteplugin/installplugin/uninstallcommand/execCore server notifications:
thread/startedthread/status/changedturn/startedturn/completeditem/starteditem/completeditem/agentMessage/deltaitem/commandExecution/outputDeltaitem/fileChange/outputDeltaitem/mcpToolCall/progressserverRequest/resolvedthread/tokenUsage/updatederrorCore server requests requiring deterministic Promptfoo responses:
item/commandExecution/requestApprovalitem/fileChange/requestApprovalitem/permissions/requestApprovalitem/tool/requestUserInputmcpServer/elicitation/requestitem/tool/callPromptfoo prompt strings remain the default. The provider should also accept a JSON array of Codex input items:
[
{ "type": "text", "text": "Review this project" },
{ "type": "local_image", "path": "/absolute/path/to/screenshot.png" },
{ "type": "skill", "name": "skill-creator", "path": "/absolute/path/SKILL.md" }
]
Supported input item types for the first implementation:
textlocal_image, mapped to app-server inputImageskillUnknown prompt JSON should be treated as plain text instead of throwing.
The provider response should include:
output: final assistant text, assembled from item/agentMessage/delta and
completed agentMessage items.sessionId: thread id.raw: serialized protocol-level turn summary and selected notifications.metadata.codexAppServer: thread id, turn id, model, cwd, sandbox, approvals,
server requests, item counts, command/file/tool trajectories, and app-server
protocol data useful for debugging.metadata.skillCalls / metadata.attemptedSkillCalls: heuristic skill usage
where available.tokenUsage: from thread/tokenUsage/updated if emitted.cost: estimated from Promptfoo's Codex pricing table when model is known.Provider-level config should be strict. Prompt-level merged config should strip unknown keys so generic Promptfoo prompt config does not break rows.
Core config:
apiKeybase_urlworking_diradditional_directoriesskip_git_repo_checkcodex_path_overridemodelmodel_providerservice_tiersandbox_modesandbox_policyapproval_policyapprovals_reviewermodel_reasoning_effortreasoning_summarypersonalityoutput_schemathread_idpersist_threadsthread_pool_sizeephemeralpersist_extended_historyexperimental_raw_eventsexperimental_apicli_configcli_envinherit_process_envreuse_serverdeep_tracingrequest_timeout_msstartup_timeout_msserver_request_policyDefault stance should favor repeatable evals over convenience:
approval_policy: neversandbox_mode: read-onlynetwork_access_enabled: falseephemeral: truereuse_server: true unless deep_tracing is enabledinherit_process_env: falseRationale:
Stdio JSON-RPC client
codex app-server.Provider lifecycle
providerRegistry.deep_tracing is enabled.Thread and turn execution
Streaming aggregation
turnId.Server request handling
tool/requestUserInput.Tracing
callApi in withGenAISpan.deep_tracing is enabled.Docs and examples
Verification
sandbox_mode: read-only if
credentials/login are available.src/providers/openai/codex-app-server.ts.openai:codex-app-serveropenai:codex-app-server:<model>openai:codex-desktopopenai:codex-desktop:<model>codex app-server --listen stdio://initializeinitializedthread/startthread/resumeturn/startturn/completedthread/unsubscribe/thread/archive cleanup modesapproval_policy: neversandbox_mode: read-onlyephemeral: truethread_cleanup: unsubscribeinherit_process_env: truedeclinedeclinedeclineagentMessage fallback/preferencethread/tokenUsage/updatedtest/providers/openai-codex-app-server.test.ts.test/providers/index.test.ts.npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=falsenpx vitest run test/providers/index.test.ts -t "Codex app-server|Codex desktop" --sequence.shuffle=falsenpm run tsc -- --pretty falsethread/resumethread/unsubscribe cleanupnpm run local -- eval -c examples/openai-codex-app-server/promptfooconfig.yaml --no-cache -o /tmp/promptfoo-codex-app-server-example.json.
sessionId, token usage, item counts, thread id,
turn id, and structured JSON output.cd site && SKIP_OG_GENERATION=true npm run buildexamples/openai-codex-app-server/review-diff/promptfooconfig.yaml
found four actionable provider issues:
raw response payload serialized unsanitized protocol itemsraw now contains sanitized thread, turn, token usage, notifications, and item
metadataagentMessage, which avoids concatenating
progress messages with final structured review outputnpm run tsc -- --pretty falsenpx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=falseexecCommandApproval / applyPatchApproval requests identify the active
thread with conversationId and expect legacy review decisionsapproved,
approved_for_session, denied, and abortconversationIdthread/unsubscribe before being removednpm run tsc -- --pretty falsenpx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=falsenpm run tsc -- --pretty falsenpx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=falsereuse_server: false or deep_tracing)npm run tsc -- --pretty falsenpx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=falsethread/start was still pending, creating duplicate persistent
threads and leaking the earlier onethread/start / thread/resume callers now share the same
pending thread handlethread/start is sent and both turns use the shared threadnpm run tsc -- --pretty falsenpx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=falseephemeral,
experimental_raw_events, and persist_extended_historynpm run tsc -- --pretty falsenpx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=falseNetwork is unreachable,
Reconnecting... 2/5), which the provider surfaced as a clean provider errorprepareEnvironmentCODEX_HOME, HOME, and
USERPROFILEcli_env.HOME skill-call metadata detectionnpm run tsc -- --pretty falsenpx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=falsenpm run local -- eval -c examples/openai-codex-app-server/review-diff/promptfooconfig.yaml --no-cache --no-share -o /tmp/promptfoo-codex-app-server-review.json{"comments":[],"summary":"No actionable findings; TypeScript and focused provider tests passed."}npm run f: pass with existing complexity warnings only; no formatting changes needednpm run tsc -- --pretty false: passnpx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false:
pass, 24 provider testsnpx vitest run test/providers/index.test.ts -t "Codex app-server|Codex desktop" --sequence.shuffle=false:
pass, 2 registry testsnpm run l: pass with existing complexity warnings onlycd site && SKIP_OG_GENERATION=true npm run build: passnpm run local -- eval -c examples/openai-codex-app-server/promptfooconfig.yaml --no-cache --no-share -o /tmp/promptfoo-codex-app-server-example.json:
passRequired mocked unit tests:
skip_git_repo_check.cli_env.initialize, initialized, thread/start, turn/start.thread_id uses thread/resume.persist_threads reuses cached thread and serializes turns.agentMessage fallback works when deltas are missing.thread/tokenUsage/updated.turn/interrupt and returns aborted error.cleanup kills child process and unregisters provider.deep_tracing injects OTEL env and disables reuse/thread persistence.Required docs/examples checks:
npm run local -- eval ... --no-cache.codex:* aliases immediately or keep all new IDs under
openai:* for consistency with the existing Codex SDK provider.Review feedback and red-team audit items addressed after the initial dogfood pass:
loadApiProvider
already merges suite-level and provider-level env into providerOptions.env; the
registry now passes that merged env to OpenAICodexAppServerProvider.service_tier to match the generated app-server schema from codex-cli 0.118.0:
fast and flex only.spawn mock implementation in beforeEach to satisfy test/AGENTS.md
mock isolation rules./tmp/codex-app-server-schema-current.XVTCwL during the audit and compared rare
app-server fields against the implementation.model_reasoning_effort: nonepersonality values: none, friendly, pragmaticbase_instructions, developer_instructions, and collaboration_modethread_pool_size as unlimited even though the implementation defaults
to 1include_raw_events was falseinitialize was
still pendingthread_pool_size and
remain over capacity after active turns finishedthread_id could overlap turns because the queue key
returned early#codex-sdk
anchorthread_id resumes skipped unsubscribe cleanuperror notifications with willRetry: true were treated as
terminalthread_id rows could unsubscribe while another row was
queued for the same threadthread_id turns even under deep tracing, preserving the OpenAI docs
#codex-sdk heading, default-unsubscribing non-persistent resumed threads, honoring
retryable app-server errors, and deferring resumed-thread unsubscribe until no other
protected queued caller remains.Latest focused verification after these fixes:
npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false:
pass, 35 provider tests.npm run local -- eval -c examples/openai-codex-app-server/review-diff/promptfooconfig.yaml --no-cache --no-share:
pass with {"comments":[],"summary":"No actionable issues found in the current diff."}.