docs/agents/coding-agent-provider-taxonomy.md
This document summarizes how promptfoo should think about coding-agent providers, what has been implemented so far, and what should come next. It is intentionally implementation-facing: use it when planning provider work, reviewing feature gaps, or deciding where a new capability belongs.
This taxonomy covers providers that run an agentic coding runtime, not ordinary single-turn model APIs. A coding-agent provider usually has some combination of:
The main providers in this family today are:
| Provider family | Provider IDs | Runtime boundary |
|---|---|---|
| OpenAI Codex SDK | openai:codex-sdk, openai:codex | @openai/codex-sdk library |
| OpenAI Codex app-server | openai:codex-app-server, openai:codex-desktop | Local codex app-server JSON-RPC process |
| Claude Agent SDK | anthropic:claude-agent-sdk, anthropic:claude-code | @anthropic-ai/claude-agent-sdk library |
| OpenCode SDK | opencode:sdk, opencode | OpenCode SDK plus local or existing server |
Standard OpenAI, Anthropic, Bedrock, Azure, and other model providers still matter for grading and comparison, but they are outside this taxonomy unless they expose a stateful coding-agent runtime.
The first question is where promptfoo stops and the agent runtime starts.
| Boundary | Meaning | Current examples |
|---|---|---|
| In-process SDK | promptfoo calls a package API directly. | Codex SDK, Claude Agent SDK |
| Managed local server | promptfoo starts a server, then talks to it through a client. | OpenCode when baseUrl is unset |
| Existing server | promptfoo connects to a runtime it does not configure. | OpenCode with baseUrl |
| Local app-server process | promptfoo starts a rich-client protocol server over stdio. | Codex app-server |
| Desktop UI process | Human-facing native app process. | Codex Desktop app, not directly attached |
This distinction matters because it controls what promptfoo can guarantee. If promptfoo starts the runtime, it can set env vars, working directories, sandbox options, tracing, and cleanup behavior. If promptfoo attaches to an existing server, that server owns authentication, installed tools, app connectors, and runtime state.
Coding agents are rarely stateless. Each provider needs explicit semantics for:
The important design rule is that session reuse must be opt-in or very clearly scoped. Reusing state silently makes eval results order-dependent.
Agent evals should separate filesystem access, network access, and shell access. Those are different risks.
| Surface | Safe default | Higher-risk mode |
|---|---|---|
| Filesystem | Temporary directory or read-only workspace | Workspace write or full filesystem access |
| Shell | Disabled or approval-gated | Allowed command execution |
| Network | Disabled unless explicitly requested | Host allow-lists, live web/search, package installs |
| App/plugin | Not installed or not invoked by default | App connectors, plugin installs, config writes |
| Environment | Minimal env | Inherited process env with secrets and local auth state |
Provider docs should make clear that danger-full-access is not the same as
network access, and read-only filesystem mode does not automatically sanitize env
vars. Each surface should have its own option and its own tests.
Promptfoo evals are non-interactive by default. Agent runtimes often expect a human to answer approval prompts, permission requests, or clarification questions. Providers should convert those into deterministic policies.
Common policy categories:
Default policy should decline, cancel, or return empty answers unless a config opts into side effects. Every accepted side effect should be visible in metadata.
The baseline input is a prompt string. Coding-agent providers increasingly need structured inputs:
Provider-specific JSON input arrays are acceptable when the underlying runtime has typed input items. Unknown JSON shapes should usually degrade to plain text rather than crashing an eval row, unless the provider docs promise strict input parsing.
All coding-agent providers should return a normal promptfoo provider result:
output: final assistant-facing text.sessionId: session/thread id when available.tokenUsage: runtime usage when available.cost: estimate when usage and model pricing are known.metadata: normalized agent metadata.raw: raw or summarized protocol data when useful and safe.Provider-specific metadata is still valuable, but consumers need a shared shape for cross-provider assertions and dashboards. A future shared schema should include:
Tracing should answer two questions:
Provider tracing should include the top-level callApi span, item/tool-level spans
where possible, and sanitized attributes for prompts, commands, tool inputs, file
paths, and outputs. Deep tracing should be opt-in when it requires injecting
OpenTelemetry env vars into a child process.
The coding-agent providers already share several practical patterns:
Useful files:
src/providers/agentic-utils.tssrc/providers/claude-agent-sdk.tssrc/providers/opencode-sdk.tssrc/providers/openai/codex-sdk.tssrc/providers/openai/codex-app-server.tssrc/providers/registry.tsStatus: implemented and documented.
Provider IDs:
openai:codex-sdkopenai:codex-sdk:<model>openai:codexopenai:codex:<model>Implemented capabilities:
@openai/codex-sdk.SKILL.md reads.Important limits:
Docs and examples:
site/docs/providers/openai-codex-sdk.mdexamples/openai-codex-sdk/Status: implemented, documented, and validated with mocked protocol tests plus a real local eval.
Provider IDs:
openai:codex-app-serveropenai:codex-app-server:<model>openai:codex-desktopopenai:codex-desktop:<model>Implemented capabilities:
codex app-server --listen stdio://.server_request_policy.Important limits:
Docs and examples:
site/docs/providers/openai-codex-app-server.mddocs/agents/codex-app-server-provider-notes.mdexamples/openai-codex-app-server/Status: implemented and documented.
Provider IDs:
anthropic:claude-agent-sdkanthropic:claude-codeImplemented capabilities:
@anthropic-ai/claude-agent-sdk.Important limits:
Docs and examples:
site/docs/providers/claude-agent-sdk.mdexamples/claude-agent-sdk/Status: implemented and documented.
Provider IDs:
opencode:sdkopencodeImplemented capabilities:
@opencode-ai/sdk v1 or v2.baseUrl is unset.baseUrl is provided.format.Important limits:
baseUrl, the existing server owns auth, MCP setup, installed agents,
and server-side configuration.Docs and examples:
site/docs/providers/opencode-sdk.mdexamples/provider-opencode-sdk/Use provider IDs that encode the runtime boundary, not just the model vendor.
openai:codex should continue to mean the Codex SDK alias because that is the
best default for automation.openai:codex-app-server should mean the app-server JSON-RPC protocol.openai:codex-desktop should remain an alias for app-server behavior unless or
until promptfoo can actually attach to the Desktop app process.anthropic:claude-code should remain an alias for Claude Agent SDK because the
SDK is still built on Claude Code.opencode can remain a convenience alias for opencode:sdk.Avoid adding unscoped top-level aliases such as codex:desktop until the OpenAI
scoped names are stable and the docs can clearly explain the difference between
SDK, app-server, and Desktop UI attachment.
Create a cross-provider metadata.agent shape while preserving provider-specific
metadata namespaces such as metadata.codexAppServer.
Proposed fields:
runtime: codex-sdk, codex-app-server, claude-agent-sdk, opencode.runtimeVersion: runtime-reported version when available.sessionId, threadId, turnId.workingDir, sandbox, network, approvalPolicy.tools: normalized tool calls and outcomes.commands: normalized shell command executions.fileChanges: normalized file write/edit/delete attempts.approvals: normalized approval prompts and decisions.mcp: normalized MCP calls and elicitations.skills: confirmed and attempted skill usage.trace: trace ids and span ids when available.Acceptance criteria:
Add a reusable test contract for coding-agent providers. Each provider can implement the same scenarios with its own mocked runtime.
Core scenarios:
Acceptance criteria:
beforeEach.Add a docs page or generated table that compares coding-agent provider capabilities.
Suggested columns:
Acceptance criteria:
Create a small set of real eval examples that can be run selectively by maintainers.
Minimum scenarios:
Acceptance criteria:
--no-cache.Add adversarial tests for the risky parts of agent runtimes.
High-value cases:
SKILL.md or plugin instructions.Acceptance criteria:
The app-server provider now covers the core eval path. The next app-server-specific work should focus on rare protocol features and product integration boundaries.
Candidates:
model/list, skills/list, plugin/list, plugin/read, and app/list
metadata discovery.review/start support for native review flows.turn/steer and turn/interrupt tests for cancellation and mid-turn control.Acceptance criteria:
Build a reusable harness for tests and examples that need write access.
It should provide:
Acceptance criteria:
Improve the information architecture around coding-agent providers.
Recommended docs:
Acceptance criteria:
Before merging a provider in this family, verify:
beforeEach.metadata.agent schema now, or keep it internal
until at least two providers use it in docs examples?codex:app-server wait for a broader provider naming
cleanup?