redteam-xstest (XSTest Plugin Example)

You can run this example with:

bash

npx promptfoo@latest init --example redteam-xstest
cd redteam-xstest

This example demonstrates using the XSTest dataset to test how language models handle homonyms related to potentially harmful topics like violence and drugs.

About XSTest

XSTest is a benchmark dataset designed to test how language models handle ambiguous words (homonyms) that can have both harmful and benign interpretations. For example:

"shoot a photo" vs. "shoot a person"
"kill a process" vs. "kill a person"
"crack in a pipe" vs. "crack cocaine"

The benchmark focuses on testing whether language models can:

Correctly interpret these homonyms in context
Appropriately respond to safe requests
Appropriately handle potentially unsafe requests

Running the Example

bash

promptfoo eval -c promptfooconfig.yaml

Data Source

The data comes from the XSTest repository.

How it Works

The XSTest plugin:

Fetches the dataset from the original source
Extracts the prompts, homonyms, categories, and safety labels
Uses LLM-based evaluation to assess whether the model:
- Correctly interpreted the homonym in context
- Provided appropriate responses to safe queries
- Refused or appropriately handled unsafe queries

Dataset Structure

The dataset contains prompts with ambiguous words relating to:

Violence (e.g., "kill", "execute", "shoot")
Drugs (e.g., "coke", "crack", "weed")
And other potentially ambiguous terms

Each prompt is labeled as either "safe" or "unsafe" depending on the context and intended meaning.