Back to Promptfoo

redteam-beavertails (BeaverTails Red Team Testing)

examples/redteam-beavertails/README.md

0.121.93.2 KB
Original Source

redteam-beavertails (BeaverTails Red Team Testing)

This example demonstrates how to use the BeaverTails plugin for red team testing, including subcategory filtering.

Overview

The BeaverTails plugin uses the BeaverTails dataset (330k harmful prompts) to test AI safety boundaries. The subcategory filtering feature allows you to focus on specific harm categories relevant to your application.

Quick Start

You can run this example with:

bash
npx promptfoo@latest init --example redteam-beavertails
cd redteam-beavertails

Configuration Files

This example includes two configuration files:

  1. promptfooconfig.yaml - Basic evaluation using HuggingFace dataset directly
  2. promptfooconfig-subcategories.yaml - Red team generation with subcategory filtering

Setup

  1. Set up your HuggingFace API token (required for both configs):

    bash
    export HF_TOKEN=your_huggingface_token
    
  2. Set up your provider API key:

    bash
    export OPENAI_API_KEY=your_openai_api_key
    

Usage

Basic Evaluation

Run the basic evaluation using the BeaverTails-Evaluation dataset:

bash
promptfoo eval

View results:

bash
promptfoo view

Red Team with Subcategory Filtering

Generate test cases with subcategory filtering:

bash
promptfoo redteam generate -c promptfooconfig-subcategories.yaml

This will create a redteam.yaml file with filtered test cases.

Then evaluate your model:

bash
promptfoo eval

View results:

bash
promptfoo view

Subcategory Filtering

The promptfooconfig-subcategories.yaml configuration shows two approaches:

  1. Basic usage - Tests all harmful categories:

    yaml
    - id: beavertails
      numTests: 5
    
  2. Filtered usage - Tests only specific subcategories:

    yaml
    - id: beavertails
      numTests: 5
      config:
        subcategories:
          - self_harm
          - privacy_violation
    

Available Subcategories

  • animal_abuse - Cruelty, harm, or distress to animals
  • child_abuse - Physical, emotional, or sexual abuse of minors
  • controversial_topics,politics - Sensitive political or social issues
  • discrimination,stereotype,injustice - Unfair treatment based on characteristics
  • drug_abuse,weapons,banned_substance - Illegal substance or weapon misuse
  • financial_crime,property_crime,theft - Fraud, embezzlement, theft
  • hate_speech,offensive_language - Hostile communication targeting groups
  • misinformation_regarding_ethics,laws_and_safety - False information about critical topics
  • non_violent_unethical_behavior - Morally wrong actions without violence
  • privacy_violation - Unauthorized access or disclosure of personal information
  • self_harm - Self-inflicted harm or suicidal content
  • sexually_explicit,adult_content - Inappropriate sexual material
  • terrorism,organized_crime - Terrorist or organized crime content
  • violence,aiding_and_abetting,incitement - Physical harm or encouragement

Learn More