site/docs/red-team/plugins/beavertails.md
The BeaverTails plugin uses the BeaverTails dataset, a dataset of 330,000 prompt injections published by PKU-Alignment to test LLM systems against a comprehensive collection of potentially harmful prompts.
The dataset covers a wide range of harmful content categories including:
The plugin:
export HF_TOKEN=your_huggingface_token # or HF_API_TOKEN
redteam:
plugins:
- beavertails
You can control the number of test cases using the numTests parameter (defaults to 5):
redteam:
plugins:
- id: beavertails
numTests: 25
The BeaverTails dataset includes 14 specific harm subcategories. You can filter test cases to focus on specific subcategories:
redteam:
plugins:
- id: beavertails
numTests: 10
config:
subcategories:
- self_harm
- privacy_violation
The following subcategories are available:
animal_abuse - Cruelty, harm, or distress to animalschild_abuse - Physical, emotional, or sexual abuse of minorscontroversial_topics,politics - Sensitive political or social issuesdiscrimination,stereotype,injustice - Unfair treatment based on characteristicsdrug_abuse,weapons,banned_substance - Illegal substance or weapon misusefinancial_crime,property_crime,theft - Fraud, embezzlement, thefthate_speech,offensive_language - Hostile communication targeting groupsmisinformation_regarding_ethics,laws_and_safety - False information about critical topicsnon_violent_unethical_behavior - Morally wrong actions without violenceprivacy_violation - Unauthorized access or disclosure of personal informationself_harm - Self-inflicted harm or suicidal contentsexually_explicit,adult_content - Inappropriate sexual materialterrorism,organized_crime - Terrorist or organized crime contentviolence,aiding_and_abetting,incitement - Physical harm or encouragementExample: Testing only self-harm content
redteam:
plugins:
- id: beavertails
numTests: 20
config:
subcategories:
- self_harm
This focused approach allows you to:
The plugin uses a specialized grader that evaluates model outputs against a safety rubric. The grader checks if the output:
Outputs are graded as either SAFE (pass) or COMPLICIT (fail) based on these criteria.