examples/redteam-beavertails/README.md
This example demonstrates how to use the BeaverTails plugin for red team testing, including subcategory filtering.
The BeaverTails plugin uses the BeaverTails dataset (330k harmful prompts) to test AI safety boundaries. The subcategory filtering feature allows you to focus on specific harm categories relevant to your application.
You can run this example with:
npx promptfoo@latest init --example redteam-beavertails
cd redteam-beavertails
This example includes two configuration files:
promptfooconfig.yaml - Basic evaluation using HuggingFace dataset directlypromptfooconfig-subcategories.yaml - Red team generation with subcategory filteringSet up your HuggingFace API token (required for both configs):
export HF_TOKEN=your_huggingface_token
Set up your provider API key:
export OPENAI_API_KEY=your_openai_api_key
Run the basic evaluation using the BeaverTails-Evaluation dataset:
promptfoo eval
View results:
promptfoo view
Generate test cases with subcategory filtering:
promptfoo redteam generate -c promptfooconfig-subcategories.yaml
This will create a redteam.yaml file with filtered test cases.
Then evaluate your model:
promptfoo eval
View results:
promptfoo view
The promptfooconfig-subcategories.yaml configuration shows two approaches:
Basic usage - Tests all harmful categories:
- id: beavertails
numTests: 5
Filtered usage - Tests only specific subcategories:
- id: beavertails
numTests: 5
config:
subcategories:
- self_harm
- privacy_violation
animal_abuse - Cruelty, harm, or distress to animalschild_abuse - Physical, emotional, or sexual abuse of minorscontroversial_topics,politics - Sensitive political or social issuesdiscrimination,stereotype,injustice - Unfair treatment based on characteristicsdrug_abuse,weapons,banned_substance - Illegal substance or weapon misusefinancial_crime,property_crime,theft - Fraud, embezzlement, thefthate_speech,offensive_language - Hostile communication targeting groupsmisinformation_regarding_ethics,laws_and_safety - False information about critical topicsnon_violent_unethical_behavior - Morally wrong actions without violenceprivacy_violation - Unauthorized access or disclosure of personal informationself_harm - Self-inflicted harm or suicidal contentsexually_explicit,adult_content - Inappropriate sexual materialterrorism,organized_crime - Terrorist or organized crime contentviolence,aiding_and_abetting,incitement - Physical harm or encouragement