BeaverTails Dataset for LLM Safety Testing

Overview

The BeaverTails plugin uses the BeaverTails dataset, a dataset of 330,000 prompt injections published by PKU-Alignment to test LLM systems against a comprehensive collection of potentially harmful prompts.

The dataset covers a wide range of harmful content categories including:

Animal and human abuse
Criminal activities and unethical behavior
Discrimination and hate speech
Explicit adult content
Misinformation and dangerous advice
Privacy violations
Violence and terrorism

How it works

The plugin:

Fetches test cases from the BeaverTails dataset
Filters out safe/benign test cases to focus on potentially harmful prompts
Randomly selects prompts from the filtered set
Tests if the model appropriately refuses or handles the harmful prompts

Quick Start

Set your Hugging Face API token as an environment variable:

bash

export HF_TOKEN=your_huggingface_token  # or HF_API_TOKEN

Enable the plugin in your configuration:

yaml

redteam:
  plugins:
    - beavertails

Configuration

You can control the number of test cases using the numTests parameter (defaults to 5):

yaml

redteam:
  plugins:
    - id: beavertails
      numTests: 25

Filtering by Subcategory

The BeaverTails dataset includes 14 specific harm subcategories. You can filter test cases to focus on specific subcategories:

yaml

redteam:
  plugins:
    - id: beavertails
      numTests: 10
      config:
        subcategories:
          - self_harm
          - privacy_violation

Available Subcategories

The following subcategories are available:

animal_abuse - Cruelty, harm, or distress to animals
child_abuse - Physical, emotional, or sexual abuse of minors
controversial_topics,politics - Sensitive political or social issues
discrimination,stereotype,injustice - Unfair treatment based on characteristics
drug_abuse,weapons,banned_substance - Illegal substance or weapon misuse
financial_crime,property_crime,theft - Fraud, embezzlement, theft
hate_speech,offensive_language - Hostile communication targeting groups
misinformation_regarding_ethics,laws_and_safety - False information about critical topics
non_violent_unethical_behavior - Morally wrong actions without violence
privacy_violation - Unauthorized access or disclosure of personal information
self_harm - Self-inflicted harm or suicidal content
sexually_explicit,adult_content - Inappropriate sexual material
terrorism,organized_crime - Terrorist or organized crime content
violence,aiding_and_abetting,incitement - Physical harm or encouragement

Example: Testing only self-harm content

yaml

redteam:
  plugins:
    - id: beavertails
      numTests: 20
      config:
        subcategories:
          - self_harm

This focused approach allows you to:

Test specific safety boundaries relevant to your application
Get more granular insights into model performance by category
Focus testing resources on high-priority harm categories

Grading

The plugin uses a specialized grader that evaluates model outputs against a safety rubric. The grader checks if the output:

Refuses to engage with harmful content
Avoids promoting or assisting with harmful activities
Maintains appropriate boundaries
Provides safe and ethical responses

Outputs are graded as either SAFE (pass) or COMPLICIT (fail) based on these criteria.

BeaverTails Dataset for LLM Safety Testing

BeaverTails Dataset for LLM Safety Testing

Overview

How it works

Quick Start

Configuration

Filtering by Subcategory

Available Subcategories

Grading

Related Concepts