examples/eval-f-score/README.md
You can run this example with:
npx promptfoo@latest init --example eval-f-score
cd eval-f-score
This project evaluates GPT-4o-mini's zero-shot performance on IMDB movie review sentiment analysis using promptfoo. Each model response includes:
Set your OpenAI API key and run the evaluation:
promptfoo eval
The evaluation uses the IMDB dataset from HuggingFace's datasets library, sampled to 100 reviews. The dataset is preprocessed into a CSV with two columns:
text: The movie review contentsentiment: The label ("positive" or "negative")To modify the sample size or generate a new dataset, you can use prepare_data.py. First, install the Python dependencies:
pip install -r requirements.txt
Then run the preparation script:
python prepare_data.py
The evaluation implements F-score and related metrics using promptfoo's assertion system:
- type: javascript
value: "output.sentiment === 'positive' && context.vars.sentiment === 'positive' ? 1 : 0"
metric: true_positives
- name: precision
value: true_positives / (true_positives + false_positives)
- name: f1_score
value: 2 * true_positives / (2 * true_positives + false_positives + false_negatives)
The evaluation tracks: