site/docs/integrations/agent-skill.md
AI coding agents can write promptfoo configs, but they often get the details wrong — shell-style env vars that don't work, hallucination rubrics that can't see the source material, tests dumped inline instead of in files. The promptfoo-evals skill fixes this by teaching your agent promptfoo's conventions and common pitfalls.
It works with Claude Code and OpenAI Codex. Because it follows the open Agent Skills standard, it should also work with other compatible tools.
Without the skill, agents frequently:
$ENV_VAR syntax in YAML configs (doesn't work — promptfoo uses Nunjucks '{{env.VAR}}')llm-rubric assertions that reference "the article" but don't inline the source, so the grader can't actually comparefile://tests/*.yamlllm-rubric when contains or is-json would be faster, free, and deterministicThe skill encodes these patterns so the agent gets them right the first time.
/plugin marketplace add promptfoo/promptfoo
/plugin install promptfoo-evals@promptfoo
For repo-local Codex usage, this repo also includes a plugin bundle at
plugins/promptfoo, exposed by .agents/plugins/marketplace.json. It contains
four focused skills: promptfoo-evals, promptfoo-provider-setup,
promptfoo-redteam-setup, and promptfoo-redteam-run.
| Skill | Use it for |
|---|---|
promptfoo-evals | Non-redteam eval suites, assertions, test cases, and result inspection |
promptfoo-provider-setup | HTTP targets plus JavaScript or Python file:// providers and wrappers |
promptfoo-redteam-setup | Focused redteam configs from live endpoints, OpenAPI specs, or static code |
promptfoo-redteam-run | Running generated scans, triaging failures, and filtered reruns |
There is intentionally no meta selector skill. Codex routes from each skill's description and default prompt, keeping the bundle small and each workflow directly invokable.
Python providers are first-class in the Codex bundle. The provider and redteam
skills cover Promptfoo's file://provider.py and
file://provider.py:function_name syntax for eval providers, redteam targets,
local graders, and local redteam generators, including workers, timeout, and
PROMPTFOO_PYTHON configuration.
Use the Claude marketplace command above when you want the portable single
promptfoo-evals skill. Use the Codex bundle when working in this repo and you
want separate eval, provider setup, redteam setup, and redteam run workflows.
To reuse the bundle elsewhere, copy plugins/promptfoo and its
.agents/plugins/marketplace.json entry together.
Download the skill directory and copy it to the correct location for your tool:
Claude Code (project-level — recommended for teams):
cp -r promptfoo-evals your-project/.claude/skills/
Claude Code (personal — available in all projects):
cp -r promptfoo-evals ~/.claude/skills/
OpenAI Codex / other Agent Skills tools:
cp -r promptfoo-evals your-project/.agents/skills/
:::note
For team adoption, commit the skill to your repo's skill directory (.claude/skills/ for Claude Code, .agents/skills/ for Codex). Every developer's agent picks it up automatically — no per-person install needed.
:::
The core skill consists of two files:
| File | Purpose |
|---|---|
SKILL.md | Workflow instructions the agent follows |
references/cheatsheet.md | Assertion types, provider patterns, and config examples |
Once installed, the agent activates automatically when you ask it to create or update eval coverage. In Claude Code, you can also invoke it directly with a slash command:
/promptfoo-evals Create an eval suite for my summarization prompt
In Codex and other Agent Skills tools, simply ask the agent to create an eval — the skill activates based on the task context.
The agent will:
promptfooconfig.yaml, prompts/, tests/)promptfoo validate:::note New to promptfoo? See Getting Started for an overview of configs, providers, and assertions. :::
contains, is-json, javascript before reaching for llm-rubric. Deterministic checks are fast, free, and reproducible.tests/*.yaml files loaded via file://tests/*.yaml glob, keeping configs clean as test count grows.tests: file://tests.csv or script-generated tests like file://generate_tests.py:create_tests.llm-rubric to check for hallucination, the source material must be inlined in the rubric via {{variable}} so the grader can actually compare.defaultTest.options.provider or assertion.provider) for stable scoring.'{{env.API_KEY}}' in YAML configs, not shell syntax.promptfoo eval -o output.json --no-cache and inspect success, score, and error.Ask the agent to "create an eval for a customer support chatbot that returns JSON" and it produces:
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: 'Customer support chatbot'
prompts:
- file://prompts/chat.json
providers:
- id: openai:chat:gpt-4.1-mini
config:
temperature: 0
response_format:
type: json_object
defaultTest:
assert:
- type: is-json
- type: cost
threshold: 0.01
tests:
- file://tests/*.yaml
- description: 'Returns order status for valid customer'
vars:
order_id: 'ORD-1001'
customer_name: 'Alice Smith'
assert:
- type: is-json
value:
type: object
required: [status, message]
- type: javascript
value: "JSON.parse(output).status === 'shipped'"
The skill is just markdown files — edit them to match your team's conventions: