Back to Promptfoo

Codex App Server Provider Notes

docs/agents/codex-app-server-provider-notes.md

0.121.924.4 KB
Original Source

Codex App Server Provider Notes

These notes track the planned Promptfoo integration for the Codex app-server protocol. They are intentionally implementation-facing: keep them current as the provider, docs, examples, and verification expand.

For the broader coding-agent provider taxonomy, see coding-agent-provider-taxonomy.md.

Objective

Add an experimental Promptfoo provider that drives codex app-server directly. The provider should complement, not replace, the existing OpenAI Codex SDK provider:

  • Codex SDK provider: best default for CI and automation.
  • Codex app-server provider: best for evaluating rich-client behavior exposed by the Codex app-server protocol, including streamed item events, approvals, skills, plugins, apps, filesystem requests, and thread lifecycle primitives.

Primary provider IDs:

  • openai:codex-app-server
  • openai:codex-app-server:<model>
  • openai:codex-desktop
  • openai:codex-desktop:<model>

Optional top-level aliases may be added after the OpenAI-scoped provider is stable:

  • codex:app-server
  • codex:desktop

Source Material

bash
codex app-server generate-ts --out /tmp/codex-app-server-schema/ts
codex app-server generate-json-schema --out /tmp/codex-app-server-schema/json

Current local schema inspection was generated from codex-cli 0.118.0.

Protocol Shape

Transport:

  • stdio:// default, JSONL messages.
  • ws://IP:PORT experimental, one JSON-RPC message per WebSocket text frame.

Handshake:

  1. Send initialize with Promptfoo client metadata.
  2. Send initialized notification.
  3. Start or resume a thread.
  4. Start a turn.
  5. Read notifications until turn/completed.

Core client requests:

  • initialize
  • thread/start
  • thread/resume
  • thread/archive
  • thread/unsubscribe
  • thread/read
  • turn/start
  • turn/steer
  • turn/interrupt
  • review/start
  • model/list
  • skills/list
  • plugin/list
  • plugin/read
  • app/list

High-risk client requests that should not be exposed casually:

  • fs/writeFile
  • fs/remove
  • fs/copy
  • config/value/write
  • config/batchWrite
  • plugin/install
  • plugin/uninstall
  • command/exec

Core server notifications:

  • thread/started
  • thread/status/changed
  • turn/started
  • turn/completed
  • item/started
  • item/completed
  • item/agentMessage/delta
  • item/commandExecution/outputDelta
  • item/fileChange/outputDelta
  • item/mcpToolCall/progress
  • serverRequest/resolved
  • thread/tokenUsage/updated
  • error

Core server requests requiring deterministic Promptfoo responses:

  • item/commandExecution/requestApproval
  • item/fileChange/requestApproval
  • item/permissions/requestApproval
  • item/tool/requestUserInput
  • mcpServer/elicitation/request
  • item/tool/call

Provider Contract

Inputs

Promptfoo prompt strings remain the default. The provider should also accept a JSON array of Codex input items:

json
[
  { "type": "text", "text": "Review this project" },
  { "type": "local_image", "path": "/absolute/path/to/screenshot.png" },
  { "type": "skill", "name": "skill-creator", "path": "/absolute/path/SKILL.md" }
]

Supported input item types for the first implementation:

  • text
  • local_image, mapped to app-server inputImage
  • skill

Unknown prompt JSON should be treated as plain text instead of throwing.

Output

The provider response should include:

  • output: final assistant text, assembled from item/agentMessage/delta and completed agentMessage items.
  • sessionId: thread id.
  • raw: serialized protocol-level turn summary and selected notifications.
  • metadata.codexAppServer: thread id, turn id, model, cwd, sandbox, approvals, server requests, item counts, command/file/tool trajectories, and app-server protocol data useful for debugging.
  • metadata.skillCalls / metadata.attemptedSkillCalls: heuristic skill usage where available.
  • tokenUsage: from thread/tokenUsage/updated if emitted.
  • cost: estimated from Promptfoo's Codex pricing table when model is known.

Config

Provider-level config should be strict. Prompt-level merged config should strip unknown keys so generic Promptfoo prompt config does not break rows.

Core config:

  • apiKey
  • base_url
  • working_dir
  • additional_directories
  • skip_git_repo_check
  • codex_path_override
  • model
  • model_provider
  • service_tier
  • sandbox_mode
  • sandbox_policy
  • approval_policy
  • approvals_reviewer
  • model_reasoning_effort
  • reasoning_summary
  • personality
  • output_schema
  • thread_id
  • persist_threads
  • thread_pool_size
  • ephemeral
  • persist_extended_history
  • experimental_raw_events
  • experimental_api
  • cli_config
  • cli_env
  • inherit_process_env
  • reuse_server
  • deep_tracing
  • request_timeout_ms
  • startup_timeout_ms
  • server_request_policy

Safety Defaults

Default stance should favor repeatable evals over convenience:

  • approval_policy: never
  • sandbox_mode: read-only
  • network_access_enabled: false
  • ephemeral: true
  • reuse_server: true unless deep_tracing is enabled
  • inherit_process_env: false
  • Server-side approval requests: decline/cancel or empty grants unless explicitly configured.

Rationale:

  • The app-server exposes shell, filesystem, app connector, plugin, and config surfaces.
  • Promptfoo evals should be deterministic and should not block on human approval.
  • Eval prompts and target behavior can be adversarial.

Implementation Phases

  1. Stdio JSON-RPC client

    • Spawn codex app-server.
    • Parse JSONL stdout.
    • Route responses, notifications, and server requests.
    • Capture stderr for debug logs.
    • Support abort and timeout.
  2. Provider lifecycle

    • Register with providerRegistry.
    • Reuse app-server process by default.
    • Shut down child processes, pending requests, and readline handles.
    • Disable reuse when deep_tracing is enabled.
  3. Thread and turn execution

    • Validate working directory.
    • Start/resume threads.
    • Start turns with prompt input, model, cwd, sandbox, approvals, effort, personality, service tier, and output schema.
    • Serialize turns per reused thread.
    • Unsubscribe/archive non-persistent threads during cleanup.
  4. Streaming aggregation

    • Track turnId.
    • Assemble assistant deltas.
    • Store completed items.
    • Build item counts and trajectory metadata.
    • Capture command output, file changes, MCP calls, dynamic tool calls, web search, plans, reasoning summaries, and review output.
  5. Server request handling

    • Deterministically answer command approvals.
    • Deterministically answer file-change approvals.
    • Return empty permission grants by default.
    • Support configured answers for tool/requestUserInput.
    • Decline/cancel MCP elicitation by default.
    • Support static dynamic-tool responses.
    • Record all requests and decisions in metadata.
  6. Tracing

    • Wrap callApi in withGenAISpan.
    • Add item-level spans when streaming notifications arrive.
    • Sanitize command output, tool arguments, and message text before trace attributes.
    • Inject OTEL env when deep_tracing is enabled.
  7. Docs and examples

    • Add provider docs.
    • Add provider index entry.
    • Add examples for basic usage, read-only repo review, structured output, approval handling, skills, and tracing.
  8. Verification

    • Unit tests with mocked child process and mocked protocol frames.
    • Registry tests for provider IDs and model-in-path parsing.
    • Docs/examples lint where applicable.
    • Local smoke config using a harmless prompt and sandbox_mode: read-only if credentials/login are available.
    • Final dogfood: run the new provider against the git diff and iterate on comments.

Progress Log

2026-04-09

  • Added initial provider implementation at src/providers/openai/codex-app-server.ts.
  • Added provider IDs under the OpenAI registry:
    • openai:codex-app-server
    • openai:codex-app-server:<model>
    • openai:codex-desktop
    • openai:codex-desktop:<model>
  • Implemented stdio JSON-RPC lifecycle:
    • spawn codex app-server --listen stdio://
    • initialize
    • initialized
    • thread/start
    • thread/resume
    • turn/start
    • notification handling through turn/completed
    • thread/unsubscribe/thread/archive cleanup modes
  • Implemented safe config defaults:
    • approval_policy: never
    • sandbox_mode: read-only
    • ephemeral: true
    • thread_cleanup: unsubscribe
    • process env isolation unless inherit_process_env: true
  • Implemented deterministic server request responses:
    • command execution approvals default to decline
    • file changes default to decline
    • permission requests default to empty grants
    • user input requests default to empty answers
    • MCP elicitations default to decline
    • dynamic tools can use static configured responses
  • Implemented output normalization:
    • assistant delta aggregation
    • completed agentMessage fallback/preference
    • token usage from thread/tokenUsage/updated
    • cost estimate for known Codex models
    • metadata with item counts, items, server request decisions, thread/turn ids
  • Implemented provider-level GenAI tracing and item spans.
  • Added mocked protocol tests in test/providers/openai-codex-app-server.test.ts.
  • Added registry tests in test/providers/index.test.ts.
  • Verification so far:
    • npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
    • npx vitest run test/providers/index.test.ts -t "Codex app-server|Codex desktop" --sequence.shuffle=false
    • npm run tsc -- --pretty false
  • Expanded mocked protocol tests to cover:
    • thread/resume
    • structured prompt input normalization
    • default thread/unsubscribe cleanup
    • user input request policy
    • dynamic tool static response policy
    • metadata sanitization
  • Ran a real local smoke eval through npm run local -- eval -c examples/openai-codex-app-server/promptfooconfig.yaml --no-cache -o /tmp/promptfoo-codex-app-server-example.json.
    • Result: pass.
    • Provider returned Codex app-server sessionId, token usage, item counts, thread id, turn id, and structured JSON output.
  • Ran docs build:
    • cd site && SKIP_OG_GENERATION=true npm run build
    • Result: pass.
  • First dogfood review through examples/openai-codex-app-server/review-diff/promptfooconfig.yaml found four actionable provider issues:
    • startup timeout could leak a spawned app-server and leave a rejected reusable connection promise cached
    • reused connections closed over the first turn's server request policy
    • app-server exit during a turn could leave the eval waiting forever when no turn timeout was configured
    • raw response payload serialized unsanitized protocol items
  • Fixed the dogfood findings and added regression coverage:
    • failed startup closes the process and a later call spawns a fresh process
    • active turns store their effective config so prompt-level server request policies are honored on reused servers
    • connection exit resolves active turns with a provider error and removes the dead connection from reuse maps
    • raw now contains sanitized thread, turn, token usage, notifications, and item metadata
    • final output now uses the last completed agentMessage, which avoids concatenating progress messages with final structured review output
  • Verification after fixes:
    • npm run tsc -- --pretty false
    • npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
    • Result: pass, 13 provider tests.
  • Second dogfood review passed the Promptfoo eval and returned valid JSON, but still reported two provider comments:
    • legacy execCommandApproval / applyPatchApproval requests identify the active thread with conversationId and expect legacy review decisions
    • persistent thread-pool eviction deleted local handles without unsubscribing the evicted loaded thread
  • Additional hardening from the second dogfood pass:
    • stdio parser now buffers partial JSON-RPC lines and rejoins literal newlines inside command-output strings as escaped newlines before parsing
    • legacy approval requests now map prompt-level policy to approved, approved_for_session, denied, and abort
    • legacy server requests can find active turn state by conversationId
    • evicted persistent cached threads now send thread/unsubscribe before being removed
    • added regression coverage for literal-newline JSON-RPC notifications, legacy approval requests, and persistent thread-pool eviction
  • Verification after second dogfood fixes:
    • npm run tsc -- --pretty false
    • npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
    • Result: pass, 16 provider tests.
  • Third dogfood review passed transport/eval and reported two lifecycle comments:
    • stale persistent thread handles remained after a reused app-server process exited
    • JSON-RPC request timeout cleanup removed the pending request but left an abort listener attached
  • Additional hardening from the third dogfood pass:
    • connection close now removes cached thread handles owned by that connection key
    • per-request timeout cleanup now removes abort listeners before rejecting
    • added regression coverage for cached-thread invalidation after process exit and abort-listener cleanup on JSON-RPC timeout
  • Verification after third dogfood fixes:
    • npm run tsc -- --pretty false
    • npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
    • Result: pass, 18 provider tests.
  • Fourth dogfood review passed transport/eval and reported two thread-cache comments:
    • persistent thread caching was still enabled for fresh-per-call app-server processes (reuse_server: false or deep_tracing)
    • pool eviction could unsubscribe an active cached thread before its turn completed
  • Additional hardening from the fourth dogfood pass:
    • thread caching is now allowed only when the app-server connection itself is reusable
    • active/reserved thread ids are protected with a small refcount while a call is using them
    • thread-pool eviction skips protected threads and temporarily allows the pool to exceed its soft cap rather than evicting an in-flight turn
    • added regression coverage for non-reusable persistent-thread configs and active-turn eviction avoidance
  • Verification after fourth dogfood fixes:
    • npm run tsc -- --pretty false
    • npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
    • Result: pass, 20 provider tests.
  • Fifth dogfood review passed transport/eval and reported one persistent-thread race:
    • concurrent calls with the same persistent-thread cache key could both miss the cache while the first thread/start was still pending, creating duplicate persistent threads and leaking the earlier one
  • Additional hardening from the fifth dogfood pass:
    • added an in-flight thread promise map keyed by thread cache key
    • concurrent same-cache thread/start / thread/resume callers now share the same pending thread handle
    • added regression coverage for concurrent same-cache persistent calls, ensuring only one thread/start is sent and both turns use the shared thread
  • Verification after fifth dogfood fix:
    • npm run tsc -- --pretty false
    • npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
    • Result: pass, 21 provider tests.
  • Sixth dogfood review passed transport/eval and reported two cache/default comments:
    • reusable connections could keep the first request timeout for requests that did not pass a per-call timeout
    • persistent thread cache keys omitted thread-start options such as ephemeral, experimental_raw_events, and persist_extended_history
  • Additional hardening from the sixth dogfood pass:
    • all provider-owned app-server requests now pass the effective per-call request timeout explicitly
    • persistent thread cache keys now include thread-start options that can change thread semantics
    • added regression coverage for prompt-level request timeouts on reused connections and thread-start option changes in persistent cache keys
  • Verification after sixth dogfood fixes:
    • npm run tsc -- --pretty false
    • npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
    • Result: pass, 23 provider tests.
  • Seventh dogfood retry:
    • first attempt hit an external Codex connectivity failure (Network is unreachable, Reconnecting... 2/5), which the provider surfaced as a clean provider error
    • retry completed transport/eval and reported one metadata issue: skill-root detection used the parent process env instead of the resolved app-server child env
  • Additional hardening from the seventh dogfood pass:
    • turn state now carries the resolved app-server environment produced by prepareEnvironment
    • skill root detection now uses the child env for CODEX_HOME, HOME, and USERPROFILE
    • added regression coverage for cli_env.HOME skill-call metadata detection
  • Verification after seventh dogfood fix:
    • npm run tsc -- --pretty false
    • npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false
    • Result: pass, 24 provider tests.
  • Eighth dogfood review:
    • npm run local -- eval -c examples/openai-codex-app-server/review-diff/promptfooconfig.yaml --no-cache --no-share -o /tmp/promptfoo-codex-app-server-review.json
    • Result: pass.
    • Provider output: {"comments":[],"summary":"No actionable findings; TypeScript and focused provider tests passed."}
    • This confirms the provider can be used to review the current git diff and return schema-valid JSON with no remaining actionable comments from the dogfood reviewer.
  • Final verification sweep:
    • npm run f: pass with existing complexity warnings only; no formatting changes needed
    • npm run tsc -- --pretty false: pass
    • npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false: pass, 24 provider tests
    • npx vitest run test/providers/index.test.ts -t "Codex app-server|Codex desktop" --sequence.shuffle=false: pass, 2 registry tests
    • npm run l: pass with existing complexity warnings only
    • cd site && SKIP_OG_GENERATION=true npm run build: pass
    • npm run local -- eval -c examples/openai-codex-app-server/promptfooconfig.yaml --no-cache --no-share -o /tmp/promptfoo-codex-app-server-example.json: pass

QA Matrix

Required mocked unit tests:

  • Constructor defaults and strict config validation.
  • Prompt-level unknown config stripping.
  • Missing/inaccessible/non-directory working directory.
  • Git check and skip_git_repo_check.
  • API key/env isolation and explicit cli_env.
  • Handshake order: initialize, initialized, thread/start, turn/start.
  • Model from provider path overrides/defaults correctly.
  • thread_id uses thread/resume.
  • persist_threads reuses cached thread and serializes turns.
  • Non-persistent calls unsubscribe/archive as configured.
  • Assistant deltas aggregate into final output.
  • Completed agentMessage fallback works when deltas are missing.
  • Token usage from thread/tokenUsage/updated.
  • Error notification produces provider error.
  • Failed turn produces provider error.
  • Abort before start.
  • Abort during turn sends turn/interrupt and returns aborted error.
  • Command approval request default decline.
  • File change request default decline.
  • Permission request default empty grant.
  • User input request configured answers.
  • Dynamic tool call configured static response.
  • MCP elicitation default decline/cancel.
  • Metadata contains item counts, trajectories, approvals, raw notifications, and server request decisions without leaking API keys.
  • cleanup kills child process and unregisters provider.
  • deep_tracing injects OTEL env and disables reuse/thread persistence.
  • Provider-level GenAI tracing records response body, token usage, session id, and item count attributes.

Required docs/examples checks:

  • Provider docs render in Docusaurus.
  • Examples are listed and runnable from repo root with npm run local -- eval ... --no-cache.
  • Config docs call out experimental status, safety defaults, and difference from Codex SDK.

Open Questions

  • Whether to expose WebSocket transport in the first public version. Stdio is enough for Promptfoo-managed app-server processes; WebSocket is useful for external clients but adds auth and lifecycle complexity.
  • Whether to support top-level codex:* aliases immediately or keep all new IDs under openai:* for consistency with the existing Codex SDK provider.
  • Whether to persist generated app-server protocol types in source. The current plan is to implement a narrow local type surface and document how to regenerate schemas instead of committing a large generated bundle.

Critical Audit Follow-up

Review feedback and red-team audit items addressed after the initial dogfood pass:

  • Fixed registry env propagation for object-shaped provider configs. loadApiProvider already merges suite-level and provider-level env into providerOptions.env; the registry now passes that merged env to OpenAICodexAppServerProvider.
  • Fixed service_tier to match the generated app-server schema from codex-cli 0.118.0: fast and flex only.
  • Reset the hoisted spawn mock implementation in beforeEach to satisfy test/AGENTS.md mock isolation rules.
  • Regenerated app-server TypeScript and JSON Schema into /tmp/codex-app-server-schema-current.XVTCwL during the audit and compared rare app-server fields against the implementation.
  • Added coverage for schema-supported rare fields:
    • model_reasoning_effort: none
    • exact personality values: none, friendly, pragmatic
    • granular approval policy objects
    • app-server command approval amendment objects
    • session-scoped permission grants
    • accepted MCP elicitation responses with content and metadata
    • base_instructions, developer_instructions, and collaboration_mode
  • Dogfood review then found additional issues:
    • reusable app-server connections stayed alive after JSON-RPC request timeouts, which could leave late side-effecting responses unmanaged
    • docs listed thread_pool_size as unlimited even though the implementation defaults to 1
    • provider cleanup cleared active turns before resolving them, which could hang shutdown while a turn was in flight
    • raw JSON-RPC notifications were retained even when include_raw_events was false
    • spawned app-server processes could be missed if cleanup ran while initialize was still pending
    • concurrent persistent thread starts could temporarily exceed thread_pool_size and remain over capacity after active turns finished
    • deep-tracing calls that shared a thread_id could overlap turns because the queue key returned early
    • the OpenAI provider docs heading change would have broken the existing #codex-sdk anchor
    • default thread_id resumes skipped unsubscribe cleanup
    • sent JSON-RPC request aborts kept the reusable app-server alive
    • retryable app-server error notifications with willRetry: true were treated as terminal
    • concurrent default-cleanup thread_id rows could unsubscribe while another row was queued for the same thread
  • Fixed these by closing/evicting connections on timeout and abort, resolving active turns during cleanup, tracking pending initialization processes, making raw event retention opt-in, rebalancing the persistent thread pool after turns finish, serializing explicit thread_id turns even under deep tracing, preserving the OpenAI docs #codex-sdk heading, default-unsubscribing non-persistent resumed threads, honoring retryable app-server errors, and deferring resumed-thread unsubscribe until no other protected queued caller remains.
  • Updated docs to explain why the app-server provider should stay separate from the Codex SDK provider: the SDK is the right default for CI and automation, while app-server is for rich-client protocol event surfaces and does not attach to a running Codex Desktop app.

Latest focused verification after these fixes:

  • npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false: pass, 35 provider tests.
  • npm run local -- eval -c examples/openai-codex-app-server/review-diff/promptfooconfig.yaml --no-cache --no-share: pass with {"comments":[],"summary":"No actionable issues found in the current diff."}.