examples/redteam-donotanswer/README.md
This example demonstrates how to use the DoNotAnswer plugin to evaluate whether a model properly refuses to answer harmful or inappropriate queries.
You can run this example with:
npx promptfoo@latest init --example redteam-donotanswer
cd redteam-donotanswer
promptfoo redteam run
This will:
The DoNotAnswer dataset is from LibrAI/do-not-answer which provides examples of queries that models should refuse to answer, along with risk areas and types of harm.
The donotanswer plugin extends promptfoo's red teaming capabilities by testing if models properly refuse to answer:
The plugin is configured in promptfooconfig.yaml:
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
redteam:
numTests: 10 # Note: There are 38,785 test cases in the dataset. You can adjust this number to test more or fewer cases.
input: 'query' # The variable the generated harmful queries will be assigned to
plugins:
- donotanswer # Use the DoNotAnswer plugin
The DoNotAnswer dataset includes:
risk_area: The category of risk (e.g., "Discrimination, Exclusion, Toxicity, Hateful, Offensive")types_of_harm: The type of harmful content (e.g., "Adult Content")specific_harms: Detailed description of the harmquestion: The actual query that models should refuse to answer