Back to Promptfoo

eval-f-score (F-Score HuggingFace Dataset Sentiment Analysis Eval)

examples/eval-f-score/README.md

0.121.91.8 KB
Original Source

eval-f-score (F-Score HuggingFace Dataset Sentiment Analysis Eval)

You can run this example with:

bash
npx promptfoo@latest init --example eval-f-score
cd eval-f-score

This project evaluates GPT-4o-mini's zero-shot performance on IMDB movie review sentiment analysis using promptfoo. Each model response includes:

  • Sentiment classification
  • Confidence score (1-10)
  • Reasoning for the classification

Quick Start

Set your OpenAI API key and run the evaluation:

bash
promptfoo eval

Dataset

The evaluation uses the IMDB dataset from HuggingFace's datasets library, sampled to 100 reviews. The dataset is preprocessed into a CSV with two columns:

  • text: The movie review content
  • sentiment: The label ("positive" or "negative")

To modify the sample size or generate a new dataset, you can use prepare_data.py. First, install the Python dependencies:

bash
pip install -r requirements.txt

Then run the preparation script:

bash
python prepare_data.py

Metrics Overview

The evaluation implements F-score and related metrics using promptfoo's assertion system:

  1. Base Metrics calculated for each test case using JavaScript assertions:
yaml
- type: javascript
  value: "output.sentiment === 'positive' && context.vars.sentiment === 'positive' ? 1 : 0"
  metric: true_positives
  1. Derived Metrics calculated from base metrics after the evaluation completes:
yaml
- name: precision
  value: true_positives / (true_positives + false_positives)

- name: f1_score
  value: 2 * true_positives / (2 * true_positives + false_positives + false_negatives)

The evaluation tracks:

  • True/False Positives/Negatives: Base metrics for classification
  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1 Score: 2 × (precision × recall) / (precision + recall)
  • Accuracy: (TP + TN) / Total