redteam-beavertails (BeaverTails Red Team Testing)

This example demonstrates how to use the BeaverTails plugin for red team testing, including subcategory filtering.

Overview

The BeaverTails plugin uses the BeaverTails dataset (330k harmful prompts) to test AI safety boundaries. The subcategory filtering feature allows you to focus on specific harm categories relevant to your application.

Quick Start

You can run this example with:

bash

npx promptfoo@latest init --example redteam-beavertails
cd redteam-beavertails

Configuration Files

This example includes two configuration files:

promptfooconfig.yaml - Basic evaluation using HuggingFace dataset directly
promptfooconfig-subcategories.yaml - Red team generation with subcategory filtering

Setup

Set up your HuggingFace API token (required for both configs):
bash
```
export HF_TOKEN=your_huggingface_token
```

Set up your provider API key:

bash

export OPENAI_API_KEY=your_openai_api_key

Usage

Basic Evaluation

Run the basic evaluation using the BeaverTails-Evaluation dataset:

bash

promptfoo eval

View results:

bash

promptfoo view

Red Team with Subcategory Filtering

Generate test cases with subcategory filtering:

bash

promptfoo redteam generate -c promptfooconfig-subcategories.yaml

This will create a redteam.yaml file with filtered test cases.

Then evaluate your model:

bash

promptfoo eval

View results:

bash

promptfoo view

Subcategory Filtering

The promptfooconfig-subcategories.yaml configuration shows two approaches:

Basic usage - Tests all harmful categories:
yaml
```
- id: beavertails
  numTests: 5
```

Filtered usage - Tests only specific subcategories:

yaml

- id: beavertails
  numTests: 5
  config:
    subcategories:
      - self_harm
      - privacy_violation

Available Subcategories

animal_abuse - Cruelty, harm, or distress to animals
child_abuse - Physical, emotional, or sexual abuse of minors
controversial_topics,politics - Sensitive political or social issues
discrimination,stereotype,injustice - Unfair treatment based on characteristics
drug_abuse,weapons,banned_substance - Illegal substance or weapon misuse
financial_crime,property_crime,theft - Fraud, embezzlement, theft
hate_speech,offensive_language - Hostile communication targeting groups
misinformation_regarding_ethics,laws_and_safety - False information about critical topics
non_violent_unethical_behavior - Morally wrong actions without violence
privacy_violation - Unauthorized access or disclosure of personal information
self_harm - Self-inflicted harm or suicidal content
sexually_explicit,adult_content - Inappropriate sexual material
terrorism,organized_crime - Terrorist or organized crime content
violence,aiding_and_abetting,incitement - Physical harm or encouragement

redteam-beavertails (BeaverTails Red Team Testing)

redteam-beavertails (BeaverTails Red Team Testing)

Overview

Quick Start

Configuration Files

Setup

Usage

Basic Evaluation

Red Team with Subcategory Filtering

Subcategory Filtering

Available Subcategories

Learn More