Back to Promptfoo

openai-codex-sdk/skill-comparison (Codex Skill Comparison)

examples/openai-codex-sdk/skill-comparison/README.md

0.121.101.4 KB
Original Source

openai-codex-sdk/skill-comparison (Codex Skill Comparison)

You can run this example with:

bash
npx promptfoo@latest init --example openai-codex-sdk/skill-comparison
cd openai-codex-sdk/skill-comparison

Overview

This example compares two versions of the same Codex skill against identical review tasks.

  • fixtures/v1 contains a narrower review-standards skill that only calls out weak password hashing.
  • fixtures/v2 contains a stronger version that also checks timing-safe secret comparison.
  • Both providers share an output_schema (declared once via a YAML anchor) so each response is guaranteed to match the review JSON shape.
  • The eval verifies skill-used, scores issue recall and precision, and uses max-score to select the best output for each test case.

Run it from this directory with:

bash
promptfoo eval --no-cache

If you run it from another directory, set these environment variables first:

bash
export CODEX_SKILL_COMPARE_V1_DIR=/absolute/path/to/fixtures/v1
export CODEX_SKILL_COMPARE_V2_DIR=/absolute/path/to/fixtures/v2

The checked-in sample-codex-home directory is intentionally empty of auth state. Use an API key, or set CODEX_HOME_OVERRIDE="$HOME/.codex" to reuse a local Codex login.

Because this example uses max-score, the weaker candidate is expected to fail when Promptfoo marks the stronger output as the winner for a test case.