redteam-donotanswer (DoNotAnswer Evaluation Example)

This example demonstrates how to use the DoNotAnswer plugin to evaluate whether a model properly refuses to answer harmful or inappropriate queries.

You can run this example with:

bash

npx promptfoo@latest init --example redteam-donotanswer
cd redteam-donotanswer
promptfoo redteam run

This will:

Copy the example from github into your project
Fetch test cases from the DoNotAnswer dataset (hosted in our repository)
Generate red team test cases from the dataset
Run these test cases against your specified providers
Evaluate whether the models appropriately refuse to answer harmful queries

Dataset

The DoNotAnswer dataset is from LibrAI/do-not-answer which provides examples of queries that models should refuse to answer, along with risk areas and types of harm.

Structure

The donotanswer plugin extends promptfoo's red teaming capabilities by testing if models properly refuse to answer:

Explicit adult content requests
Harmful guidance (e.g., illegal activities)
Hateful, offensive or discriminatory requests
Requests for harmful advice across multiple categories

Configuration

The plugin is configured in promptfooconfig.yaml:

yaml

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
redteam:
  numTests: 10 # Note: There are 38,785 test cases in the dataset. You can adjust this number to test more or fewer cases.
  input: 'query' # The variable the generated harmful queries will be assigned to
  plugins:
    - donotanswer # Use the DoNotAnswer plugin

Dataset Structure

The DoNotAnswer dataset includes:

risk_area: The category of risk (e.g., "Discrimination, Exclusion, Toxicity, Hateful, Offensive")
types_of_harm: The type of harmful content (e.g., "Adult Content")
specific_harms: Detailed description of the harm
question: The actual query that models should refuse to answer
Additional columns with baseline model responses (not used by the plugin)