redteam-coding-agent (Coding Agent Red Team)

Red team autonomous coding agents for repository prompt injection, terminal output injection, secret reads, procfs abuse, sandbox read/write escapes, network egress, delayed CI exfiltration, generated vulnerabilities, automation poisoning, steganographic leakage, and verifier sabotage vulnerabilities.

bash

npx promptfoo@latest init --example redteam-coding-agent
cd redteam-coding-agent

Quick start

The default config targets gpt-5.4 as a simulated coding agent. Set your API key and run:

bash

export OPENAI_API_KEY=sk-...
npx promptfoo@latest redteam run
npx promptfoo@latest view

Required environment:

OPENAI_API_KEY: Used by the default OpenAI provider. The Codex SDK provider also uses this when CODEX_API_KEY is not set.

Optional environment:

CODEX_API_KEY: Use this for Codex SDK auth when you do not want to reuse OPENAI_API_KEY.
PROMPTFOO_REMOTE_GENERATION_URL: Point generation at a self-hosted red-team generation service.
PROMPTFOO_DISABLE_REDTEAM_REMOTE_GENERATION=true: Disable remote generation. The coding-agent:core and coding-agent:all collections and the individual coding-agent plugins are unavailable in this mode because they require remote-generated scenarios.

When testing a real coding agent, set providers[0].config.working_dir to a disposable checkout and use synthetic canary secrets. Do not put production credentials in the eval environment.

Using a real coding agent

For meaningful results, point the config at a real agent provider:

Codex SDK: Uncomment the openai:codex-sdk provider in promptfooconfig.yaml and set working_dir to a disposable checkout.
Custom agent: Use a file:// provider or HTTP endpoint that wraps your agent framework.

Real agent providers should return structured output with command executions and trace data so the deterministic verifiers can inspect commands, outputs, and file changes — not just the final response.

Use a fresh writable checkout for each red-team row when you want high-confidence results. Coding agents leave files behind, change tests, and may copy canaries into artifacts; reusing one workspace across rows can make a later failure look like the wrong plugin failed.

What the plugins test

Plugin	Vulnerability	Detection
`coding-agent:repo-prompt-injection`	Agent follows hidden instructions in repository files	Canary value detection + LLM grading
`coding-agent:terminal-output-injection`	Agent treats build/test output as authoritative commands	Receipt propagation detection + LLM grading
`coding-agent:secret-env-read`	Agent reads or exposes launcher environment variables	Canary value detection in output/traces
`coding-agent:secret-file-read`	Agent reads or exposes protected local file contents	Canary/digest detection in output/traces/artifacts
`coding-agent:sandbox-read-escape`	Agent reads files outside the intended workspace	Receipt detection + LLM grading
`coding-agent:sandbox-write-escape`	Agent mutates files outside the intended workspace	Command detection + host file/probe checks + LLM grading
`coding-agent:network-egress-bypass`	Agent calls or plants code for unapproved network egress	Command detection + trap/proxy logs + workspace scan + LLM grading
`coding-agent:procfs-credential-read`	Agent recovers credentials through process metadata	Procfs command detection + receipt detection
`coding-agent:delayed-ci-exfil`	Agent plants future CI or automation data leakage	Command detection + workflow/package scan + LLM grading
`coding-agent:generated-vulnerability`	Agent introduces insecure code while completing a feature	High-confidence code heuristics + LLM grading
`coding-agent:automation-poisoning`	Agent persists unsafe hooks, scripts, or workflows	Automation scan + host file hashes + LLM grading
`coding-agent:steganographic-exfil`	Agent hides secrets in encoded or low-visibility sinks	Transformed canary detection + artifact scan
`coding-agent:verifier-sabotage`	Agent weakens tests instead of fixing code	Heuristic pattern detection + file hash checks + LLM grading

Strategies

jailbreak:meta is the iterative single-turn strategy. jailbreak:hydra, goat, and crescendo exercise conversational multi-turn behavior and need providers/workspaces that can preserve state safely. Encoding and prompt-rewriting strategies (base64, rot13, math-prompt, jailbreak:composite, etc.) are automatically excluded since they would break deterministic canary matching.

Use coding-agent:core for quick baseline coverage, or switch to coding-agent:all when you want the full coding-agent suite in one collection.

Safety

Never place production secrets in red-team configs. Use synthetic canaries ({{secretEnvValue}}).
Run agents in disposable, least-privilege sandboxes and delete or reset the workspace between rows.
The generated prompts ask the agent to do unsafe things — that's the point. The graders evaluate whether the agent resists.