Codex App Server Provider Notes

These notes track the planned Promptfoo integration for the Codex app-server protocol. They are intentionally implementation-facing: keep them current as the provider, docs, examples, and verification expand.

For the broader coding-agent provider taxonomy, see coding-agent-provider-taxonomy.md.

Objective

Add an experimental Promptfoo provider that drives codex app-server directly. The provider should complement, not replace, the existing OpenAI Codex SDK provider:

Codex SDK provider: best default for CI and automation.
Codex app-server provider: best for evaluating rich-client behavior exposed by the Codex app-server protocol, including streamed item events, approvals, skills, plugins, apps, filesystem requests, and thread lifecycle primitives.

Primary provider IDs:

openai:codex-app-server
openai:codex-app-server:<model>
openai:codex-desktop
openai:codex-desktop:<model>

Optional top-level aliases may be added after the OpenAI-scoped provider is stable:

codex:app-server
codex:desktop

Source Material

Official docs: https://developers.openai.com/codex/app-server
Local CLI: /Applications/Codex.app/Contents/Resources/codex app-server --help
Local generated schema command:

bash

codex app-server generate-ts --out /tmp/codex-app-server-schema/ts
codex app-server generate-json-schema --out /tmp/codex-app-server-schema/json

Current local schema inspection was generated from codex-cli 0.118.0.

Protocol Shape

Transport:

stdio:// default, JSONL messages.
ws://IP:PORT experimental, one JSON-RPC message per WebSocket text frame.

Handshake:

Send initialize with Promptfoo client metadata.
Send initialized notification.
Start or resume a thread.
Start a turn.
Read notifications until turn/completed.

Core client requests:

initialize
thread/start
thread/resume
thread/archive
thread/unsubscribe
thread/read
turn/start
turn/steer
turn/interrupt
review/start
model/list
skills/list
plugin/list
plugin/read
app/list

High-risk client requests that should not be exposed casually:

fs/writeFile
fs/remove
fs/copy
config/value/write
config/batchWrite
plugin/install
plugin/uninstall
command/exec

Core server notifications:

thread/started
thread/status/changed
turn/started
turn/completed
item/started
item/completed
item/agentMessage/delta
item/commandExecution/outputDelta
item/fileChange/outputDelta
item/mcpToolCall/progress
serverRequest/resolved
thread/tokenUsage/updated
error

Core server requests requiring deterministic Promptfoo responses:

item/commandExecution/requestApproval
item/fileChange/requestApproval
item/permissions/requestApproval
item/tool/requestUserInput
mcpServer/elicitation/request
item/tool/call

Provider Contract

Inputs

Promptfoo prompt strings remain the default. The provider should also accept a JSON array of Codex input items:

json

[
  { "type": "text", "text": "Review this project" },
  { "type": "local_image", "path": "/absolute/path/to/screenshot.png" },
  { "type": "skill", "name": "skill-creator", "path": "/absolute/path/SKILL.md" }
]

Supported input item types for the first implementation:

text
local_image, mapped to app-server inputImage
skill

Unknown prompt JSON should be treated as plain text instead of throwing.

Output

The provider response should include:

output: final assistant text, assembled from item/agentMessage/delta and completed agentMessage items.
sessionId: thread id.
raw: serialized protocol-level turn summary and selected notifications.
metadata.codexAppServer: thread id, turn id, model, cwd, sandbox, approvals, server requests, item counts, command/file/tool trajectories, and app-server protocol data useful for debugging.
metadata.skillCalls / metadata.attemptedSkillCalls: heuristic skill usage where available.
tokenUsage: from thread/tokenUsage/updated if emitted.
cost: estimated from Promptfoo's Codex pricing table when model is known.

Config

Provider-level config should be strict. Prompt-level merged config should strip unknown keys so generic Promptfoo prompt config does not break rows.

Core config:

apiKey
base_url
working_dir
additional_directories
skip_git_repo_check
codex_path_override
model
model_provider
service_tier
sandbox_mode
sandbox_policy
approval_policy
approvals_reviewer
model_reasoning_effort
reasoning_summary
personality
output_schema
thread_id
persist_threads
thread_pool_size
ephemeral
persist_extended_history
experimental_raw_events
experimental_api
cli_config
cli_env
inherit_process_env
reuse_server
deep_tracing
request_timeout_ms
startup_timeout_ms
server_request_policy

Safety Defaults

Default stance should favor repeatable evals over convenience:

approval_policy: never
sandbox_mode: read-only
network_access_enabled: false
ephemeral: true
reuse_server: true unless deep_tracing is enabled
inherit_process_env: false
Server-side approval requests: decline/cancel or empty grants unless explicitly configured.

Rationale:

The app-server exposes shell, filesystem, app connector, plugin, and config surfaces.
Promptfoo evals should be deterministic and should not block on human approval.
Eval prompts and target behavior can be adversarial.

Implementation Phases

Stdio JSON-RPC client
- Spawn codex app-server.
- Parse JSONL stdout.
- Route responses, notifications, and server requests.
- Capture stderr for debug logs.
- Support abort and timeout.
Provider lifecycle
- Register with providerRegistry.
- Reuse app-server process by default.
- Shut down child processes, pending requests, and readline handles.
- Disable reuse when deep_tracing is enabled.
Thread and turn execution
- Validate working directory.
- Start/resume threads.
- Start turns with prompt input, model, cwd, sandbox, approvals, effort, personality, service tier, and output schema.
- Serialize turns per reused thread.
- Unsubscribe/archive non-persistent threads during cleanup.
Streaming aggregation
- Track turnId.
- Assemble assistant deltas.
- Store completed items.
- Build item counts and trajectory metadata.
- Capture command output, file changes, MCP calls, dynamic tool calls, web search, plans, reasoning summaries, and review output.
Server request handling
- Deterministically answer command approvals.
- Deterministically answer file-change approvals.
- Return empty permission grants by default.
- Support configured answers for tool/requestUserInput.
- Decline/cancel MCP elicitation by default.
- Support static dynamic-tool responses.
- Record all requests and decisions in metadata.
Tracing
- Wrap callApi in withGenAISpan.
- Add item-level spans when streaming notifications arrive.
- Sanitize command output, tool arguments, and message text before trace attributes.
- Inject OTEL env when deep_tracing is enabled.
Docs and examples
- Add provider docs.
- Add provider index entry.
- Add examples for basic usage, read-only repo review, structured output, approval handling, skills, and tracing.
Verification
- Unit tests with mocked child process and mocked protocol frames.
- Registry tests for provider IDs and model-in-path parsing.
- Docs/examples lint where applicable.
- Local smoke config using a harmless prompt and sandbox_mode: read-only if credentials/login are available.
- Final dogfood: run the new provider against the git diff and iterate on comments.

Progress Log

2026-04-09

Added initial provider implementation at src/providers/openai/codex-app-server.ts.
Added provider IDs under the OpenAI registry:
- openai:codex-app-server
- openai:codex-app-server:<model>
- openai:codex-desktop
- openai:codex-desktop:<model>
Implemented stdio JSON-RPC lifecycle:
- spawn codex app-server --listen stdio://
- initialize
- initialized
- thread/start
- thread/resume
- turn/start
- notification handling through turn/completed
- thread/unsubscribe/thread/archive cleanup modes
Implemented safe config defaults:
- approval_policy: never
- sandbox_mode: read-only
- ephemeral: true
- thread_cleanup: unsubscribe
- process env isolation unless inherit_process_env: true
Implemented deterministic server request responses:
- command execution approvals default to decline
- file changes default to decline
- permission requests default to empty grants
- user input requests default to empty answers
- MCP elicitations default to decline
- dynamic tools can use static configured responses
Implemented output normalization:
- assistant delta aggregation
- completed agentMessage fallback/preference
- token usage from thread/tokenUsage/updated
- cost estimate for known Codex models
- metadata with item counts, items, server request decisions, thread/turn ids
Implemented provider-level GenAI tracing and item spans.
Added mocked protocol tests in test/providers/openai-codex-app-server.test.ts.
Added registry tests in test/providers/index.test.ts.
Verification so far:
- npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
- npx vitest run test/providers/index.test.ts -t "Codex app-server|Codex desktop" --sequence.shuffle=false
- npm run tsc -- --pretty false
Expanded mocked protocol tests to cover:
- thread/resume
- structured prompt input normalization
- default thread/unsubscribe cleanup
- user input request policy
- dynamic tool static response policy
- metadata sanitization
Ran a real local smoke eval through npm run local -- eval -c examples/openai-codex-app-server/promptfooconfig.yaml --no-cache -o /tmp/promptfoo-codex-app-server-example.json.
- Result: pass.
- Provider returned Codex app-server sessionId, token usage, item counts, thread id, turn id, and structured JSON output.
Ran docs build:
- cd site && SKIP_OG_GENERATION=true npm run build
- Result: pass.
First dogfood review through examples/openai-codex-app-server/review-diff/promptfooconfig.yaml found four actionable provider issues:
- startup timeout could leak a spawned app-server and leave a rejected reusable connection promise cached
- reused connections closed over the first turn's server request policy
- app-server exit during a turn could leave the eval waiting forever when no turn timeout was configured
- raw response payload serialized unsanitized protocol items
Fixed the dogfood findings and added regression coverage:
- failed startup closes the process and a later call spawns a fresh process
- active turns store their effective config so prompt-level server request policies are honored on reused servers
- connection exit resolves active turns with a provider error and removes the dead connection from reuse maps
- raw now contains sanitized thread, turn, token usage, notifications, and item metadata
- final output now uses the last completed agentMessage, which avoids concatenating progress messages with final structured review output
Verification after fixes:
- npm run tsc -- --pretty false
- npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
- Result: pass, 13 provider tests.
Second dogfood review passed the Promptfoo eval and returned valid JSON, but still reported two provider comments:
- legacy execCommandApproval / applyPatchApproval requests identify the active thread with conversationId and expect legacy review decisions
- persistent thread-pool eviction deleted local handles without unsubscribing the evicted loaded thread
Additional hardening from the second dogfood pass:
- stdio parser now buffers partial JSON-RPC lines and rejoins literal newlines inside command-output strings as escaped newlines before parsing
- legacy approval requests now map prompt-level policy to approved, approved_for_session, denied, and abort
- legacy server requests can find active turn state by conversationId
- evicted persistent cached threads now send thread/unsubscribe before being removed
- added regression coverage for literal-newline JSON-RPC notifications, legacy approval requests, and persistent thread-pool eviction
Verification after second dogfood fixes:
- npm run tsc -- --pretty false
- npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
- Result: pass, 16 provider tests.
Third dogfood review passed transport/eval and reported two lifecycle comments:
- stale persistent thread handles remained after a reused app-server process exited
- JSON-RPC request timeout cleanup removed the pending request but left an abort listener attached
Additional hardening from the third dogfood pass:
- connection close now removes cached thread handles owned by that connection key
- per-request timeout cleanup now removes abort listeners before rejecting
- added regression coverage for cached-thread invalidation after process exit and abort-listener cleanup on JSON-RPC timeout
Verification after third dogfood fixes:
- npm run tsc -- --pretty false
- npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
- Result: pass, 18 provider tests.
Fourth dogfood review passed transport/eval and reported two thread-cache comments:
- persistent thread caching was still enabled for fresh-per-call app-server processes (reuse_server: false or deep_tracing)
- pool eviction could unsubscribe an active cached thread before its turn completed
Additional hardening from the fourth dogfood pass:
- thread caching is now allowed only when the app-server connection itself is reusable
- active/reserved thread ids are protected with a small refcount while a call is using them
- thread-pool eviction skips protected threads and temporarily allows the pool to exceed its soft cap rather than evicting an in-flight turn
- added regression coverage for non-reusable persistent-thread configs and active-turn eviction avoidance
Verification after fourth dogfood fixes:
- npm run tsc -- --pretty false
- npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
- Result: pass, 20 provider tests.
Fifth dogfood review passed transport/eval and reported one persistent-thread race:
- concurrent calls with the same persistent-thread cache key could both miss the cache while the first thread/start was still pending, creating duplicate persistent threads and leaking the earlier one
Additional hardening from the fifth dogfood pass:
- added an in-flight thread promise map keyed by thread cache key
- concurrent same-cache thread/start / thread/resume callers now share the same pending thread handle
- added regression coverage for concurrent same-cache persistent calls, ensuring only one thread/start is sent and both turns use the shared thread
Verification after fifth dogfood fix:
- npm run tsc -- --pretty false
- npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
- Result: pass, 21 provider tests.
Sixth dogfood review passed transport/eval and reported two cache/default comments:
- reusable connections could keep the first request timeout for requests that did not pass a per-call timeout
- persistent thread cache keys omitted thread-start options such as ephemeral, experimental_raw_events, and persist_extended_history
Additional hardening from the sixth dogfood pass:
- all provider-owned app-server requests now pass the effective per-call request timeout explicitly
- persistent thread cache keys now include thread-start options that can change thread semantics
- added regression coverage for prompt-level request timeouts on reused connections and thread-start option changes in persistent cache keys
Verification after sixth dogfood fixes:
- npm run tsc -- --pretty false
- npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
- Result: pass, 23 provider tests.
Seventh dogfood retry:
- first attempt hit an external Codex connectivity failure (Network is unreachable, Reconnecting... 2/5), which the provider surfaced as a clean provider error
- retry completed transport/eval and reported one metadata issue: skill-root detection used the parent process env instead of the resolved app-server child env
Additional hardening from the seventh dogfood pass:
- turn state now carries the resolved app-server environment produced by prepareEnvironment
- skill root detection now uses the child env for CODEX_HOME, HOME, and USERPROFILE
- added regression coverage for cli_env.HOME skill-call metadata detection
Verification after seventh dogfood fix:
- npm run tsc -- --pretty false
- npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
- Result: pass, 24 provider tests.
Eighth dogfood review:
- npm run local -- eval -c examples/openai-codex-app-server/review-diff/promptfooconfig.yaml --no-cache --no-share -o /tmp/promptfoo-codex-app-server-review.json
- Result: pass.
- Provider output: {"comments":[],"summary":"No actionable findings; TypeScript and focused provider tests passed."}
- This confirms the provider can be used to review the current git diff and return schema-valid JSON with no remaining actionable comments from the dogfood reviewer.
Final verification sweep:
- npm run f: pass with existing complexity warnings only; no formatting changes needed
- npm run tsc -- --pretty false: pass
- npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false: pass, 24 provider tests
- npx vitest run test/providers/index.test.ts -t "Codex app-server|Codex desktop" --sequence.shuffle=false: pass, 2 registry tests
- npm run l: pass with existing complexity warnings only
- cd site && SKIP_OG_GENERATION=true npm run build: pass
- npm run local -- eval -c examples/openai-codex-app-server/promptfooconfig.yaml --no-cache --no-share -o /tmp/promptfoo-codex-app-server-example.json: pass

QA Matrix

Required mocked unit tests:

Constructor defaults and strict config validation.
Prompt-level unknown config stripping.
Missing/inaccessible/non-directory working directory.
Git check and skip_git_repo_check.
API key/env isolation and explicit cli_env.
Handshake order: initialize, initialized, thread/start, turn/start.
Model from provider path overrides/defaults correctly.
thread_id uses thread/resume.
persist_threads reuses cached thread and serializes turns.
Non-persistent calls unsubscribe/archive as configured.
Assistant deltas aggregate into final output.
Completed agentMessage fallback works when deltas are missing.
Token usage from thread/tokenUsage/updated.
Error notification produces provider error.
Failed turn produces provider error.
Abort before start.
Abort during turn sends turn/interrupt and returns aborted error.
Command approval request default decline.
File change request default decline.
Permission request default empty grant.
User input request configured answers.
Dynamic tool call configured static response.
MCP elicitation default decline/cancel.
Metadata contains item counts, trajectories, approvals, raw notifications, and server request decisions without leaking API keys.
cleanup kills child process and unregisters provider.
deep_tracing injects OTEL env and disables reuse/thread persistence.
Provider-level GenAI tracing records response body, token usage, session id, and item count attributes.

Required docs/examples checks:

Provider docs render in Docusaurus.
Examples are listed and runnable from repo root with npm run local -- eval ... --no-cache.
Config docs call out experimental status, safety defaults, and difference from Codex SDK.

Open Questions

Whether to expose WebSocket transport in the first public version. Stdio is enough for Promptfoo-managed app-server processes; WebSocket is useful for external clients but adds auth and lifecycle complexity.
Whether to support top-level codex:* aliases immediately or keep all new IDs under openai:* for consistency with the existing Codex SDK provider.
Whether to persist generated app-server protocol types in source. The current plan is to implement a narrow local type surface and document how to regenerate schemas instead of committing a large generated bundle.

Critical Audit Follow-up

Review feedback and red-team audit items addressed after the initial dogfood pass:

Fixed registry env propagation for object-shaped provider configs. loadApiProvider already merges suite-level and provider-level env into providerOptions.env; the registry now passes that merged env to OpenAICodexAppServerProvider.
Fixed service_tier to match the generated app-server schema from codex-cli 0.118.0: fast and flex only.
Reset the hoisted spawn mock implementation in beforeEach to satisfy test/AGENTS.md mock isolation rules.
Regenerated app-server TypeScript and JSON Schema into /tmp/codex-app-server-schema-current.XVTCwL during the audit and compared rare app-server fields against the implementation.
Added coverage for schema-supported rare fields:
- model_reasoning_effort: none
- exact personality values: none, friendly, pragmatic
- granular approval policy objects
- app-server command approval amendment objects
- session-scoped permission grants
- accepted MCP elicitation responses with content and metadata
- base_instructions, developer_instructions, and collaboration_mode
Dogfood review then found additional issues:
- reusable app-server connections stayed alive after JSON-RPC request timeouts, which could leave late side-effecting responses unmanaged
- docs listed thread_pool_size as unlimited even though the implementation defaults to 1
- provider cleanup cleared active turns before resolving them, which could hang shutdown while a turn was in flight
- raw JSON-RPC notifications were retained even when include_raw_events was false
- spawned app-server processes could be missed if cleanup ran while initialize was still pending
- concurrent persistent thread starts could temporarily exceed thread_pool_size and remain over capacity after active turns finished
- deep-tracing calls that shared a thread_id could overlap turns because the queue key returned early
- the OpenAI provider docs heading change would have broken the existing #codex-sdk anchor
- default thread_id resumes skipped unsubscribe cleanup
- sent JSON-RPC request aborts kept the reusable app-server alive
- retryable app-server error notifications with willRetry: true were treated as terminal
- concurrent default-cleanup thread_id rows could unsubscribe while another row was queued for the same thread
Fixed these by closing/evicting connections on timeout and abort, resolving active turns during cleanup, tracking pending initialization processes, making raw event retention opt-in, rebalancing the persistent thread pool after turns finish, serializing explicit thread_id turns even under deep tracing, preserving the OpenAI docs #codex-sdk heading, default-unsubscribing non-persistent resumed threads, honoring retryable app-server errors, and deferring resumed-thread unsubscribe until no other protected queued caller remains.
Updated docs to explain why the app-server provider should stay separate from the Codex SDK provider: the SDK is the right default for CI and automation, while app-server is for rich-client protocol event surfaces and does not attach to a running Codex Desktop app.

Latest focused verification after these fixes:

npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false: pass, 35 provider tests.
npm run local -- eval -c examples/openai-codex-app-server/review-diff/promptfooconfig.yaml --no-cache --no-share: pass with {"comments":[],"summary":"No actionable issues found in the current diff."}.