Back to Promptfoo

eval-conversation-relevance (Conversation Relevance)

examples/eval-conversation-relevance/README.md

0.121.92.8 KB
Original Source

eval-conversation-relevance (Conversation Relevance)

You can run this example with:

bash
npx promptfoo@latest init --example eval-conversation-relevance
cd eval-conversation-relevance

This example demonstrates how to use the conversation-relevance assertion to evaluate whether chatbot responses remain relevant throughout a conversation.

What is Conversation Relevance?

The conversation relevance metric evaluates whether each response in a conversation is relevant to the context and previous messages. It uses a sliding window approach to analyze conversation segments.

Running the Example

  1. Install dependencies:

    bash
    npm install -g promptfoo
    
  2. Set your OpenAI API key:

    bash
    export OPENAI_API_KEY=your-api-key
    
  3. Run the evaluation:

    bash
    promptfoo eval
    

Example Test Cases

1. Single-turn Evaluation

Tests basic relevance for a single query-response pair about travel to Paris.

2. Multi-turn Travel Conversation

Evaluates a complete conversation about travel planning where all responses should be relevant.

3. Conversation with Irrelevant Response

Demonstrates detection of an off-topic response (stock market comment) in the middle of a conversation about wedding planning.

4. Technical Support Conversation

Shows a high-quality technical support conversation with a high relevance threshold (0.95).

Configuration Options

  • threshold: Minimum score required to pass (0-1)
  • config.windowSize: Number of messages in each sliding window (default: 5)
  • provider: Override the default grading model

Interpreting Results

  • Score: Proportion of conversation windows deemed relevant
  • Pass/Fail: Based on whether the score meets the threshold
  • Reason: Explanation when responses are found irrelevant

Tips

  1. Use lower thresholds (0.7-0.8) for general conversations
  2. Use higher thresholds (0.9-0.95) for specialized domains like technical support
  3. Adjust window size based on conversation complexity
  4. Consider using more capable models (GPT-4) for grading complex conversations

How Scoring Works

The metric evaluates each message position using a sliding window approach. For example, with a 5-message conversation and window size of 3:

  • Window 1: Message 1 only (evaluates if Response 1 is relevant)
  • Window 2: Messages 1-2 (evaluates if Response 2 is relevant given context)
  • Window 3: Messages 1-3 (evaluates if Response 3 is relevant given context)
  • Window 4: Messages 2-4 (evaluates if Response 4 is relevant given context)
  • Window 5: Messages 3-5 (evaluates if Response 5 is relevant given context)

Each window evaluates whether the LAST assistant response in that window is relevant. The final score is:

text
Score = Number of Relevant Windows / Total Number of Windows