openai-codex-sdk/skill-comparison (Codex Skill Comparison)

You can run this example with:

bash

npx promptfoo@latest init --example openai-codex-sdk/skill-comparison
cd openai-codex-sdk/skill-comparison

Overview

This example compares two versions of the same Codex skill against identical review tasks.

fixtures/v1 contains a narrower review-standards skill that only calls out weak password hashing.
fixtures/v2 contains a stronger version that also checks timing-safe secret comparison.
Both providers share an output_schema (declared once via a YAML anchor) so each response is guaranteed to match the review JSON shape.
The eval verifies skill-used, scores issue recall and precision, and uses max-score to select the best output for each test case.

Run it from this directory with:

bash

promptfoo eval --no-cache

If you run it from another directory, set these environment variables first:

bash

export CODEX_SKILL_COMPARE_V1_DIR=/absolute/path/to/fixtures/v1
export CODEX_SKILL_COMPARE_V2_DIR=/absolute/path/to/fixtures/v2

The checked-in sample-codex-home directory is intentionally empty of auth state. Use an API key, or set CODEX_HOME_OVERRIDE="$HOME/.codex" to reuse a local Codex login.

Because this example uses max-score, the weaker candidate is expected to fail when Promptfoo marks the stronger output as the winner for a test case.